# **Data Collection Notebook**

## Objectives
- Fetch data from Kaggle and save it as raw data.
- Inspect the data and save it under outputs

## Inputs
- Kaggle JSON file - the authentication token.

## Outputs
- Generate Dataset: outputs/data_collection.csv

## Additional Comments / Conclusions
- The data is provided by Code Institute as training data for this project 5.
- The following parameter do not have a numeric type: ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'], dtype='object'
- The following columns have the most Missing Values: ['EnclosedPorch', 'WoodDeckSF', 'LotFrontage ', 'GarageFinish', 'BsmtFinType1']
- The parameters which have the highest correlation with Sales Price are : 'OverallQual' and 'GrLivArea' 

---

## Install following python packages in the notebooks

numpy==1.26.1
pandas==2.1.1
matplotlib==3.8.0
seaborn==0.13.2
ydata-profiling==4.12.0 # This package can be removed prior to deployment
plotly==5.17.0
ppscore==1.1.0 # This package can be removed prior to deployment
streamlit==1.40.2
feature-engine==1.6.1
imbalanced-learn==0.11.0
scikit-learn==1.3.1
xgboost==1.7.6
yellowbrick==1.5 # This package can be removed prior to deployment
Pillow==10.0.1 # This package can be removed prior to deployment

---

## Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/ci-c5-housing-market-prices/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/ci-c5-housing-market-prices'

---

## Get data

Data is provided by Kaggle. The file is downloaded unzipped and manually added to the project via drag and drop.

In [31]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

---

## Load and insprect Kaggle data

In [4]:
import pandas as pd
df = pd.read_csv(r"inputs/house_prices_records.csv")
df.head()

# suggestions for fixation: use \\, r"", / - it all didn't work out

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


## Check data types and identify all non-numeric values:

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

In [11]:
non_numeric_columns = df.select_dtypes(include=['object']).columns
print("\nNon-Numeric Columns:")
print(non_numeric_columns)


Non-Numeric Columns:
Index(['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'], dtype='object')


## Get to know the data based on Sales Price. Provide average, median, highest and lowest values

In [12]:
# Calculate the average (mean) of the 'sales price' column
average_sales_price = df['SalePrice'].mean()
print(f"Average Sales Price: {average_sales_price}")

# Calculate the median of the 'sales price' column
median_sales_price = df['SalePrice'].median()
print(f"Median Sales Price: {median_sales_price}")

# Calculate the highest and lowest value of the 'sales price' column
highest_sales_price = df['SalePrice'].max()
lowest_sales_price = df['SalePrice'].min()
print(f"Highest Sales Price: {highest_sales_price}")
print(f"Lowest Sales Price: {lowest_sales_price}")

# Calculate the Interquartile Range (IQR) to identify outliers
Q1 = df['SalePrice'].quantile(0.25)
Q3 = df['SalePrice'].quantile(0.75)
IQR = Q3 - Q1

# Define outliers as values outside of 1.5 * IQR above Q3 or below Q1
outliers = df[(df['SalePrice'] < (Q1 - 1.5 * IQR)) | (df['SalePrice'] > (Q3 + 1.5 * IQR))]
print(f"Outliers in Sales Price:\n{outliers}")

# Display the summary statistics for 'sales price'
summary_stats = df['SalePrice'].describe()
print(f"Summary Statistics:\n{summary_stats}")

Average Sales Price: 180921.19589041095
Median Sales Price: 163000.0
Highest Sales Price: 755000
Lowest Sales Price: 34900
Outliers in Sales Price:
      1stFlrSF  2ndFlrSF  BedroomAbvGr BsmtExposure  BsmtFinSF1 BsmtFinType1  \
11        1182    1142.0           4.0           No         998          NaN   
53        1842       0.0           0.0           Gd        1810          GLQ   
58        1426    1519.0           3.0           Gd           0          Unf   
112       1282    1414.0           4.0           Av         984          GLQ   
151       1710       0.0           2.0           Gd        1400          NaN   
...        ...       ...           ...          ...         ...          ...   
1268      1968    1479.0           4.0           Mn         192          Rec   
1353      2053    1185.0           4.0           Av         816          NaN   
1373      2633       0.0           2.0           Gd        1282          GLQ   
1388      1746       0.0           3.0           Gd 

### Identifying missing values

In [13]:
# Check for missing values in the entire dataset
missing_values = df.isnull()

# Count the number of missing values per column
missing_count_per_column = missing_values.sum()
print("\nNumber of Missing Values Per Column:")
print(missing_count_per_column)


Number of Missing Values Per Column:
1stFlrSF            0
2ndFlrSF           86
BedroomAbvGr       99
BsmtExposure       38
BsmtFinSF1          0
BsmtFinType1      145
BsmtUnfSF           0
EnclosedPorch    1324
GarageArea          0
GarageFinish      235
GarageYrBlt        81
GrLivArea           0
KitchenQual         0
LotArea             0
LotFrontage       259
MasVnrArea          8
OpenPorchSF         0
OverallCond         0
OverallQual         0
TotalBsmtSF         0
WoodDeckSF       1305
YearBuilt           0
YearRemodAdd        0
SalePrice           0
dtype: int64


Total number of missing values in the entire dataset (counting each 'True' as 1)

In [14]:
total_missing_values = missing_values.sum().sum()
print(f"\nTotal Missing Values in the Dataset: {total_missing_values}")


Total Missing Values in the Dataset: 3580


Get the top 5 columns with the most missing values

In [15]:
top_5_missing_columns = missing_count_per_column.sort_values(ascending=False).head(5)
print("\nTop 5 Columns with Most Missing Values:")
print(top_5_missing_columns)


Top 5 Columns with Most Missing Values:
EnclosedPorch    1324
WoodDeckSF       1305
LotFrontage       259
GarageFinish      235
BsmtFinType1      145
dtype: int64


### Understanding the missing values:

#### Columns with 0 missing values:

    1stFlrSF, BsmtFinSF1, BsmtUnfSF, GrLivArea, KitchenQual, LotArea, OverallCond, OverallQual, TotalBsmtSF, YearBuilt, YearRemodAdd, SalePrice
    
These columns don't require any action since they have no missing values.

#### Columns with a substantial number of missing values:

    EnclosedPorch (1324 missing), WoodDeckSF (1305 missing)
    
These columns have a very high number of missing values (likely close to being empty columns). It might be better to drop them from the dataset because imputing them would likely not be reliable, and they may not contribute much to your analysis or model.

#### Columns with moderate missing values:

    2ndFlrSF (86 missing), BedroomAbvGr (99 missing), BsmtExposure (38 missing), BsmtFinType1 (145 missing), GarageFinish (235 missing), GarageYrBlt (81 missing), LotFrontage (259 missing), MasVnrArea (8 missing)
    
These columns have moderate amounts of missing data. You should consider imputation or, in some cases, dropping them, depending on their relevance to the analysis.

### Results of the correlation analysis

According to the results of the corelation analysis the following parameters have the biggest impact on the sales price:

#### High Correlation: Correlation coefficient ∣r∣>0.7∣r∣>0.7

- OverallQual      0.790982
- GrLivArea        0.708624


#### Medium Correlation: Correlation coefficient ∣r∣∈[0.3,0.7]∣r∣∈[0.3,0.7]

- KitchenQual      0.659600
- GarageArea       0.623431
- TotalBsmtSF      0.613581
- 1stFlrSF         0.605852
- YearBuilt        0.522897
- GarageFinish     0.510537
- YearRemodAdd     0.507101
- GarageYrBlt      0.486362
- MasVnrArea       0.477493
- BsmtFinSF1       0.386420
- LotFrontage      0.351799
- 2ndFlrSF         0.322335
- OpenPorchSF      0.315856


#### Low Correlation: Correlation coefficient ∣r∣<0.3∣r∣<0.3

- BsmtFinType1     0.275054
- LotArea          0.263843
- WoodDeckSF       0.252027
- BsmtUnfSF        0.214479
- BedroomAbvGr     0.161901
- BsmtExposure     0.106263
- OverallCond     -0.077856
- EnclosedPorch   -0.176458

---

## Push files to Repo

In [16]:
import os
try:
  os.makedirs(name='outputs/data_collected') # create data_collection folder
except Exception as e:
  print(e)

In [17]:
df.to_csv(f"outputs/data_collected/house_pricing_data.csv",index=False)