# **Data Collection Notebook**

## Objectives
- Fetch data from Kaggle and save it as raw data.
- Inspect the data and save it under outputs

## Inputs
- Kaggle JSON file - the authentication token.

## Outputs
- Generate Dataset: outputs/data_collection.csv

## Additional Comments / Conclusions
- The data is provided by Code Institute as training data for this project 5.
- The following parameter do not have a numeric type: ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'], dtype='object'
- The following columns have missing Values: ['EnclosedPorch', 'WoodDeckSF', 'LotFrontage ', 'GarageFinish', 'BsmtFinType1', 'BedroomAbvGr', '2ndFlrSF', 'GarageYrBlt', 'BsmtExposure', 'MasVnrArea']
- The following columns have the value zero. All values should remain, since it is an indication that the relevatn attribute is not available: ['MasVnrArea', '2ndFlrSF', 'openPorchSF', 'BsmtFinSF1', 'BsmtUnfSF', 'EnclosedPorch', 'GarageArea', 'WoodDeckSF', 'TotalBsmtSF', 'BedroomAbvGr']

---

## Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/ci-c5-housing-market-prices/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/ci-c5-housing-market-prices'

---

## Install following python packages in the notebooks

In [5]:
%pip install -r requirements.txt

Collecting numpy==1.26.1 (from -r requirements.txt (line 1))
  Downloading numpy-1.26.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting pandas==2.1.1 (from -r requirements.txt (line 2))
  Downloading pandas-2.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting matplotlib==3.8.0 (from -r requirements.txt (line 3))
  Downloading matplotlib-3.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Collecting seaborn==0.13.2 (from -r requirements.txt (line 4))
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting ydata-profiling==4.12.0 (from -r requirements.txt (line 5))
  Downloading ydata_profiling-4.12.0-py2.py3-none-any.whl.metadata (20 kB)
Collecting plotly==5.17.0 (from -r requirements.txt (line 6))
  Downloading plotly-5.17.0-py2.py3-none-any.whl.metadata (7.0 kB)
Collecting ppscore==1.1.0 (from -r requirements.txt (line 7))
  Downloading ppscore-1.1.0.tar.gz 

---

## Get data

Data is provided by Kaggle. The file is downloaded unzipped and manually added to the project via drag and drop.

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

---

## Load and insprect Kaggle data

In [8]:
%pip uninstall pandas -y
%pip install pandas


Found existing installation: pandas 2.1.1
Uninstalling pandas-2.1.1:
  Successfully uninstalled pandas-2.1.1
Note: you may need to restart the kernel to use updated packages.
Collecting pandas
  Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.7/12.7 MB[0m [31m75.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pandas
Successfully installed pandas-2.2.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
df = pd.read_csv(r"inputs/house_prices_records.csv")
df.head()

# suggestions for fixation: use \\, r"", / - it all didn't work out

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


### Check data types and identify all non-numeric values:

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

In [None]:
non_numeric_columns = df.select_dtypes(include=['object']).columns
print("\nNon-Numeric Columns:")
print(non_numeric_columns)


Non-Numeric Columns:
Index(['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'], dtype='object')


### Identifying missing values

Total number of missing values in the entire dataset (counting each 'True' as 1)

In [9]:
total_missing_values = missing_values.sum().sum()
print(f"\nTotal Missing Values in the Dataset: {total_missing_values}")


Total Missing Values in the Dataset: 3580


All columns with missing data

In [10]:
# Count missing values per column
missing_count_per_column = df.isnull().sum()

# Filter and sort columns with missing values
missing_columns = missing_count_per_column[missing_count_per_column > 0].sort_values(ascending=False)

print("\nColumns with Missing Values and Their Counts:")
print(missing_columns)


Columns with Missing Values and Their Counts:
EnclosedPorch    1324
WoodDeckSF       1305
LotFrontage       259
GarageFinish      235
BsmtFinType1      145
BedroomAbvGr       99
2ndFlrSF           86
GarageYrBlt        81
BsmtExposure       38
MasVnrArea          8
dtype: int64


### Understanding missing values:

#### Columns with 0 missing values:

    1stFlrSF, BsmtFinSF1, BsmtUnfSF, GrLivArea, KitchenQual, LotArea, OverallCond, OverallQual, TotalBsmtSF, YearBuilt, YearRemodAdd, SalePrice
    
These columns don't require any action since they have no missing values.

#### Columns with a substantial number of missing values:

    EnclosedPorch (1324 missing), WoodDeckSF (1305 missing)
    
These columns have a very high number of missing values (likely close to being empty columns). It might be better to drop them from the dataset because imputing them would likely not be reliable, and they may not contribute much to your analysis or model.

#### Columns with moderate missing values:

    2ndFlrSF (86 missing), BedroomAbvGr (99 missing), BsmtExposure (38 missing), BsmtFinType1 (145 missing), GarageFinish (235 missing), GarageYrBlt (81 missing), LotFrontage (259 missing), MasVnrArea (8 missing)
    
These columns have moderate amounts of missing data. It could be worth an imputation or, in some cases, dropping them, depending on their relevance to the analysis.

### Identifying value 'zero'

Total count of zero values in the entire DataFrame

In [11]:
total_zeros = (df == 0).sum().sum()
print(f"Total Number of Zero Values in the Dataset: {total_zeros}")

Total Number of Zero Values in the Dataset: 3201


Columns with Zero Values and Their Counts

In [12]:
# Count zero values per column
zero_count_per_column = (df == 0).sum()

# Filter columns that have at least one zero value
columns_with_zeros = zero_count_per_column[zero_count_per_column > 0].sort_values(ascending=False)

print("\nColumns with Zero Values and Their Counts:")
print(columns_with_zeros)


Columns with Zero Values and Their Counts:
MasVnrArea       861
2ndFlrSF         781
OpenPorchSF      656
BsmtFinSF1       467
BsmtUnfSF        118
EnclosedPorch    116
GarageArea        81
WoodDeckSF        78
TotalBsmtSF       37
BedroomAbvGr       6
dtype: int64


### Understanding value 'zero' for relevant variables

All these variables are linked with the unit square feet. Here a zero indicates that the house does not include the specific attribute: 
'MasVnrArea', '2ndFlrSF', 'openPorchSF', 'BsmtFinSF1', 'BsmtUnfSF', 'EnclosedPorch', 'GarageArea', 'WoodDeckSF', 'TotalBsmtSF'


'BedroomAbvGr' means 'bedrooms above grade' It does NOT include basement bedrooms. All houses with this values does NOT have an bedroom above grade.

No data adjustment needed.
 

---

## Push files to Repo

In [13]:
import os
try:
  os.makedirs(name='outputs/data_collected') # create data_collection folder
except Exception as e:
  print(e)

[Errno 17] File exists: 'outputs/data_collected'


In [14]:
df.to_csv(f"outputs/data_collected/house_pricing_data.csv",index=False)