# **(ADD HERE THE NOTEBOOK NAME)**

## Objectives

* Write here your notebook objective, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artifacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 

## CRISP-DM
* Data Understanding

---

# Change working directory

* We are assuming you will store the notebooks in a sub folder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")    

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

Section 1 content

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/unzipped/house_prices_records.csv")
print(df.shape)
df.head()

In [None]:
df.info()

In [None]:
from pandas_profiling import ProfileReport

pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

---

# Correlation Study

Section 2 content

In [None]:
from feature_engine.imputation import CategoricalImputer
df_corr = df.copy()
imputer = CategoricalImputer(imputation_method='frequent',
                                variables=["BsmtFinType1","GarageFinish"])
imputer.fit(df_corr)


In [None]:
df_corr =imputer.transform(df_corr)

In [None]:
df_corr.isnull().sum()


In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df_corr.columns[df_corr.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df_corr)
print(df_ohe.shape)
df_ohe.head(3)

---

In [None]:
df_pearson= df_ohe.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(5)
# pearson_cols=df_pearson.filter(['SalePrice'])
# pearson_cols
df_pearson


In [None]:
df_spearman= df_ohe.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(5)
# spearman_cols=df_spearman.filter(['SalePrice']).sort_values(by='SalePrice', key=abs, ascending=False)[1:].head(5)
# spearman_cols
df_spearman

In [None]:
cols_to_study= set(df_spearman.index.to_list()+df_pearson.index.to_list())
cols_to_study

These variables correlate most closely to SalesPrice

In [None]:



# # df_eda =df_ohe.filter(['SalePrice'])
# # df_eda.head(20)

df_eda = df_ohe.filter(['1stFlrSF',
 'GarageArea',
 'GrLivArea',
 'OverallQual',
 'TotalBsmtSF',
 'YearBuilt',
 'SalePrice'])
df_eda.head()



In [None]:
from feature_engine.discretisation import EqualFrequencyDiscretiser
discretiser= EqualFrequencyDiscretiser(q=6, variables=['SalePrice'])
discretiser.fit(df_eda)
df_eda=discretiser.transform(df_eda)
df_eda




In [None]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
sns.histplot(data=df_eda, x='SalePrice')
plt.show()


In [None]:
discretiser.binner_dict_

In [None]:
labels=discretiser.binner_dict_['SalePrice']
q_value= len(labels)-1
labels_map={}

for x in range(0,q_value):
    if x == 0:
        labels_map[x] = f"< {int(labels[1])}"
    elif x < q_value -1:
        labels_map[x] = f"{int(labels[x])} - {int(labels[x+1])}"
    else:
        labels_map[x] =f"{int(labels[x])} +"

labels_map

In [None]:
df_eda["SalePrice"] = df_eda["SalePrice"].replace(labels_map)
df_eda

In [None]:
hue_order = labels_map.values()
list(hue_order)

function: Exploratory Data Analysis  Exploratory Data Analysis Tools  Correlation Unit 2: Analysis

In [None]:
%matplotlib inline
import seaborn as sns
import numpy as np
sns.set_style('whitegrid')


def plot_categorical(df, col, target_var):

  plt.figure(figsize=(12, 5))
  sns.countplot(data=df, x=col, hue=target_var,order = df[col].value_counts().index)
  plt.xticks(rotation=90) 
  plt.title(f"{col}", fontsize=20,y=1.05)        
  plt.show()

def plot_numerical(df, col, target_var,hue_order):
  plt.figure(figsize=(8, 5))
  sns.histplot(data=df, x=col, hue=target_var, hue_order=hue_order,kde=True,element="step") 
  plt.title(f"{col}", fontsize=20,y=1.05)
  plt.show()



target_var = 'SalePrice'
for col in ['1stFlrSF',
 'GarageArea',
 'GrLivArea',
 'OverallQual',
 'TotalBsmtSF',
 'YearBuilt']:
  if df_eda[col].dtype == 'object':
    plot_categorical(df_eda, col, target_var)
    print("\n\n")
  else:
    plot_numerical(df_eda, col, target_var, hue_order)
    print("\n\n")


## Conclusions
* Houses with high Sales Prices tend to have first floors with at least 1500 square feet.

* Houses with low Sales Prices tend to have no garage and those with a garage of at least 600 square feet tend to have high Sales Prices.

* Houses with high sales prices tend to have above grade living area of at least 1500 square feet. Those with low sales prices tend to have 1000 square feet or less.

* Houses with high Sales prices tend to have at least a Very Good Overall Quality Rating.

* Houses with high Sales Prices tend to have basements with at a square footage of at least 1200. Houses with no basements or basements with less than 1000 square feet tend to have low Sales prices.

* Houses do not tnd to have a high Sales Price if built before 1990.


NOTE

* You may add how many sections you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section for "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
