# Dealing with Categorical Variables - Lab

## Introduction

In this lab, you'll explore the Ames Housing dataset and identify numeric and categorical variables. Then you'll transform some categorical data and use it in a multiple regression model.

## Objectives

You will be able to:

* Determine whether variables are categorical or numeric
* Use one-hot encoding to create dummy variables

## Step 1: Load the Ames Housing Dataset

Import `pandas`, and use it to load the file `ames.csv` into a dataframe called `ames`. If you pass in the argument `index_col=0` this will set the "Id" feature as the index.

In [99]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm

# Your code here - load the dataset
ames = pd.read_csv('ames.csv', index_col=0)

Visually inspect `ames` (it's ok if you can't see all of the columns).

In [100]:
# Your code here

ames.head()
# remove NAs


Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


Go ahead and drop all **columns** with missing data, to simplify the problem. Remember that you can use the `dropna` method ([documentation here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)).

In [101]:
na_columns = ['Alley','PoolQC','Fence','MiscFeature']

# Your code here - drop columns with missing data
ames = ames.drop(columns=na_columns)

## Step 2: Identify Numeric and Categorical Variables

The file `data_description.txt`, located in this repository, has a full description of all variables.

Using this file as well as `pandas` techniques, identify the following predictors:

1. A **continuous numeric** predictor
2. A **discrete numeric** predictor
3. A **string categorical** predictor
4. A **discrete categorical** predictor

(Note that `SalePrice` is the target variable and should not be selected as a predictor.)

For each of these predictors, visualize the relationship between the predictor and `SalePrice` using an appropriate plot.

Finding these will take some digging -- don't be discouraged if they're not immediately obvious. The Ames Housing dataset is a lot more complex than the Auto MPG dataset. There is also no single right answer here.

### Continuous Numeric Predictor

In [102]:
import matplotlib as plt
# Your code here - continuous numeric predictor
# Scatter plot of 1stFlrSF vs SalePrice

fig = plt.figure(figsize=(15,4))
ames.plot.scatter(x='1stFlrSF', y='SalePrice');

TypeError: 'module' object is not callable

### Discrete Numeric Predictor

In [72]:
# Your code here - discrete numeric predictor
# OverallQual, OverallCond, YearBuilt, YearRemodAdd, YrSold, Fireplaces, GarageYrBlt, GarageCars, MoSold

Y_dnp = ames['SalePrice']
X = ames[['OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'YrSold', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'MoSold']]

X_train, X_test, y_train, y_test = train_test_split(X, Y_dnp, test_size=0.2, random_state=43)

model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

X_train_with_constant = sm.add_constant(X_train)

# Fit the model using statsmodels
ols_model = sm.OLS(y_train, X_train_with_constant).fit()

# Print the summary
print(ols_model.summary())
# OverallCond

Mean Squared Error: 3280285192.561438
R-squared: 0.654220897577289
                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.681
Model:                            OLS   Adj. R-squared:                  0.674
Method:                 Least Squares   F-statistic:                     108.0
Date:                Wed, 21 Aug 2024   Prob (F-statistic):          4.04e-107
Time:                        08:45:28   Log-Likelihood:                -5716.1
No. Observations:                 466   AIC:                         1.145e+04
Df Residuals:                     456   BIC:                         1.149e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------

### String Categorical Predictor

### Discrete Categorical Predictor

In [90]:
# Your code here - discrete categorical predictor
# MSSubClass, OverallQual, OverallCond, 
columns_names = ['MSSubClass', 'OverallQual', 'OverallCond']
columns_names = [col for col in columns_names if col in ames.columns]

# Convert categorical columns to category dtype
for col in columns_names:
    ames[col] = ames[col].astype('category')


# One-hot encode categorical variables
ames_encoded = pd.get_dummies(ames, columns=columns_names, drop_first=True)

# Define predictor variables (X) and target variable (y)
X = ames_encoded.drop('SalePrice', axis=1)
y = ames_encoded['SalePrice']

# Check for any non-numeric data types in X and y
print("X Data Types:")
print(X.dtypes)

print("y Data Type:")
print(y.dtypes)

# Convert to numeric, if necessary
X = X.apply(pd.to_numeric, errors='coerce')
y = pd.to_numeric(y, errors='coerce')

# Ensure indices are aligned
X, y = X.align(y, join='inner', axis=0)

# Add a constant to the model (intercept)
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X)
results = model.fit()

# Print the summary of the regression model
print(results.summary())

X Data Types:
MSZoning         category
LotFrontage       float64
LotArea             int64
Street           category
LotShape         category
                   ...   
OverallCond_5       uint8
OverallCond_6       uint8
OverallCond_7       uint8
OverallCond_8       uint8
OverallCond_9       uint8
Length: 99, dtype: object
y Data Type:
int64


MissingDataError: exog contains inf or nans

In [88]:
# List of categorical columns
columns_names = ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation','BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'SaleType', 'SaleCondition']
ames = ames.dropna()
# Ensure all columns in columns_names exist in ames
columns_names = [col for col in columns_names if col in ames.columns]

# Convert categorical columns to category dtype
for col in columns_names:
    ames[col] = ames[col].astype('category')

# One-hot encode categorical variables
ames_encoded = pd.get_dummies(ames, columns=columns_names, drop_first=True)

# Define predictor variables (X) and target variable (y)
X = ames_encoded.drop('SalePrice', axis=1)
y = ames_encoded['SalePrice']

# Check for any non-numeric data types in X and y
print("X Data Types:")
print(X.dtypes)

print("y Data Type:")
print(y.dtypes)

# Convert to numeric, if necessary
X = X.apply(pd.to_numeric, errors='coerce')
y = pd.to_numeric(y, errors='coerce')

# Ensure indices are aligned
X, y = X.align(y, join='inner', axis=0)

# Add a constant to the model (intercept)
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X)
results = model.fit()

# Print the summary of the regression model
print(results.summary())

X Data Types:
MSSubClass                 int64
LotFrontage              float64
LotArea                    int64
OverallQual                int64
OverallCond                int64
                          ...   
SaleType_WD                uint8
SaleCondition_Alloca       uint8
SaleCondition_Family       uint8
SaleCondition_Normal       uint8
SaleCondition_Partial      uint8
Length: 202, dtype: object
y Data Type:
int64


MissingDataError: exog contains inf or nans

## Step 3: Build a Multiple Regression Model with Your Chosen Predictors

Choose the best-looking 3 out of 4 predictors to include in your model.

Make sure that you one-hot encode your categorical predictor(s) (regardless of whether the current data type is a string or number) first.

In [25]:
# Your code here - prepare X and y, including one-hot encoding


In [28]:
# Your answer here - which category or categories were dropped?


In [30]:
# Your code here - build a regression model and display results


## Step 4: Create Partial Regression Plots for Features

For each feature of the regression above (including the dummy features), plot the partial regression.

In [32]:
# Your code here - create partial regression plots


## Step 5: Calculate an Error-Based Metric

In addition to the adjusted R-Squared that we can see in the model summary, calculate either MAE or RMSE for this model.

In [34]:
# Your code here - calculate an error-based metric


## Step 6: Summarize Findings

Between the model results, partial regression plots, and error-based metric, what does this model tell you? What would your next steps be to improve the model?

In [36]:
# Your answer here


## Level Up (Optional)

Try transforming X using scikit-learn _and_ fitting a scikit-learn linear regression as well. If there are any differences in the result, investigate them.

In [38]:
# Your code here

## Summary

In this lab, you practiced your knowledge of categorical variables on the Ames Housing dataset! Specifically, you practiced distinguishing numeric and categorical data. You then created dummy variables using one hot encoding in order to build a multiple regression model.