# Data Science Project


**College/University Name**: _CICCC - Cornerstone International Community College of Canada_  
**Course**: _Machine Learning_  
**Instructor**: _Austin Egbal_  
**Student Name**: _Amir Lima Oliveira_  
**Submission Date**: _2025-08-dd_  

---

### Project Title
    _Housing Proces Competition for Kaggle Learn Users_
---

#### Objective
* Build an end-to-end pipeline with:
    - EDA (Exploratory Data Analysis)
    - Processing
    - Modeling
    - Evaluation
    - Conclusion
---

#### Dataset Overview
- **Source:** [[Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/competitions/home-data-for-ml-course/data)]
- **Description:** Dataset of Ames housing with 80 features including property size, quality, location, and amenities to predict final sale price in dollars (target variable).
- **Credits:** DanB. Housing Prices Competition for Kaggle Learn Users. https://kaggle.com/competitions/home-data-for-ml-course, 2018. Kaggle.

---

## Table of Contents
### 1. [Import Libraries](#import-libraries)  


In [18]:
import os
import zipfile
# Packages for data manipulation
import pandas as pd
import numpy as np

# Packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# APIs for data access
import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi

# Packages for machine learning
import sklearn as sk

# Packages for data preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler

# Data splitting, model training, evaluation
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

# ML Algorithms
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, SGDRegressor, SGDClassifier

# Data Evaluation
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score



### 2. [Load & Inspect Data](#load--inspect-data)

In [20]:
# Authenticate and download dataset from Kaggle
api = KaggleApi()
api.authenticate()

# Create a directory to store the dataset
os.makedirs('./data', exist_ok=True)
extract_to = './data'
zip_path = './data/home-data-for-ml-course.zip'

# Download the dataset from Kaggle
kaggle.api.competition_download_files('home-data-for-ml-course', path='./data')

# Extract the downloaded zip file
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to)
    
# Load the dataset
df = pd.read_csv('./data/train.csv')

   - [Shape](#shape)

In [21]:
df.shape

(1460, 81)

   - [Missing Values](#missing-values)

In [23]:
df.isnull().sum()

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

   - [Data Types](#data-types) 

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

   - [Preview Data](#preview-data)

In [26]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


---

### 3. [Data Cleaning](#data-cleaning)

   - [Drop Duplicates](#drop-duplicates)

In [None]:
df = df.drop

   - [Standardize Text and Formats](#standardize-text-and-formats)

   - [Convert Data Types](#convert-data-types)

   - [Filter Irrelevant Records](#filter-irrelevant-records)


   - [Handle Inconsistent Values](#handle-inconsistent-values)  

---

4. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)  
   - [Univariate Analysis](#univariate-analysis)  
   - [Bivariate & Multivariate Analysis](#bivariate--multivariate-analysis)  
   - [Distribution of Variables](#distribution-of-variables)  
   - [Correlation Analysis](#correlation-analysis)  
   - [Outlier Detection](#outlier-detection)  
   - [Initial Insights](#initial-insights)  

---

5. [Feature Engineering](#feature-engineering)  
   - [Handling Missing Data](#handling-missing-data)  
   - [Encoding Categorical Variables](#encoding-categorical-variables)  
   - [Creating New Features](#creating-new-features)  
   - [Feature Transformation (Scaling, Normalization)](#feature-transformation-scaling-normalization)  
   - [Feature Selection](#feature-selection)  


6. [Modeling / Statistical Analysis](#modeling--statistical-analysis)  
---

7. [Evaluation & Interpretation](#evaluation--interpretation)

---

8. [Conclusions](#conclusions)  
---

9. [Future Work](#future-work)  
---

10. [References](#references)  
