## MMA 867 Team Assignment: Predicting Housing Prices

Team Istanbul

In [None]:
import os

import pandas as pd
import numpy as np

import statsmodels.imputation.mice as mice
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.diagnostic import het_breuschpagan

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import linear_model
from sklearn.metrics import mean_absolute_error, mean_squared_error

from patsy import dmatrices 
import scipy.stats as stats

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Converting data source to dataframes
file_path_test  = "test.csv"
file_path_train = "train.csv"

df_test  = pd.read_csv(file_path_test)
df_train = pd.read_csv(file_path_train)

In [None]:
#Concatenating the two dataframes
df_new_name = pd.concat([df_train, df_test], ignore_index=True)

### Guidelines

<h6 style="color: blue;">  
<ul>
  <li>Train dataset contains sales</li>
  <li>Test dataset doesn't contain sales</li>
</ul>
</h6>

1. Combine both datasets *(hint: use pd.concat)*
2. Let all steps run as is in your version of the code, make sure to update the name of dataframe *(hint: use reeplace all feature in VS Code)*
3. Right before running the model, separate the test and train dataset's data *(hint: df_train contains SalePrice, but df_test doesn't. There are other ways too)*
4. Run the model for the **entire** train dataset
5. Predict the sale price only for test dataset
6. Save the predicted sales values as a csv file (wih only the ID and the Sale Price):
```submission = pd.DataFrame({
    "Id": test["Id"],
    "SalePrice": predictions
})
```

```submission.to_csv("submission.csv", index=False)```

7. Upload your code file & csv file (with predicted output) to Kaggle and share the rank with the team.

#### Reading your rank

| Rank         | Outcome       | Action                                                                 |
|:--------------|:---------------|:------------------------------------------------------------------------|
| <span style="color:green">Over 20%</span>     | Excellent      | Finalize and submit this version of the code for the assignment.       |
| <span style="color:#CCCC00">21–30%</span>   | Good           | Review feature selection and tuning to further optimize performance.   |
| <span style="color:orange">31–40%</span>      | Okay           | Re-evaluate model parameters and feature engineering for improvements. |
| <span style="color:red">Below 40%</span>      | Unsatisfactory | Revisit EDA, improve feature engineering, and retrain the model.       |


#### Proceed to the next step if your rank is unsatisfactory

1. Focus on EDA and feature engineering, revisit the code, make updates as you please.  
2. Start by changing small parts of your code and check if your rank improves.  
3. Avoid making too many big changes at once—your rank might get worse.   
4. Tune your model's parameters to improve accuracy.   
5. Keep track of what changes you make and how they affect your rank.  
6. Re-submit when you're confident the changes improved your results.  

#### Use of AI

1. Please align it with the rest of the code
2. Do not use data science techniques that you cannot explain - if you cannot explain the logic behind the code (not what the code is doing - why you're using it) <b><span style="color:red"> PLEASE DO NOT USE IT </span></b>

### Timeline

- **Sat Apr 19, 11:59 PM** – First submission (unchanged code files)  
  _Please share your rankings with the team_

- **Wed Apr 23, 10:00 PM** – Final submission (iterations of code update)  
  _Please share your rankings with the team_

- **Thu Apr 24, 9:00 PM** – Lead to Second

- **Fri Apr 25, 6:00 PM** – Second to Team

- **Fri Apr 25, 10:00 PM** – Submission



### Assignments

There are no subteams for this assignment. It is all individual work as we have already done most of the assignment. This will help you develop skills & play around with different techniques 

<h6 style="color: blue;">First Submission (unchanged code files)</h6>

The lead & second from 803 have not been assigned this task in order to allow them additional time to focus on 803

| Version | Member  |
|:---------|:---------|
| V1      | Omar    |
| V2      | Sudip   |
| V3      | Lillian |
| V4      | Thorn   |

Please upload your updated versions to the appropriate <a href="https://queensuca.sharepoint.com/:f:/r/teams/GROUP-MMA2026-Istanbul/Shared%20Documents/MMA%20867%20-%20Predictive%20Analytics/Team%20Assignment/First%20Submission?csf=1&web=1&e=MboMj4">sharepoint directory</a>, using naming convention ```HousingPrices_Vx_Rx``` where Vx is the version and Rx is the ranking

<h6 style="color: blue;">Version Assignment for Iterations</h6>

| Member  | Version Assigned |
|:---------|:------------------|
| Jill    | V1               |
| Omar    | V2               |
| Lavanya | V3               |
| Rabab   | V4               |
| Sudip   | V1               |
| Thorn   | V2               |
| Lillian | V3               |

- If your ranking improves, please upload your file to the appropriate <a href="https://queensuca.sharepoint.com/:f:/r/teams/GROUP-MMA2026-Istanbul/Shared%20Documents/MMA%20867%20-%20Predictive%20Analytics/Team%20Assignment/Ranking%20Improvements?csf=1&web=1&e=cGYlwt">sharepoint directory</a>, using naming convention ```HousingPrices_XX_Rx``` where XX are your initials and Rx is the ranking.
- If after a couple tries you feel you want to try a different version, please contact the lead and the second (so we can track the versions and edits)
- As soon as anyone reaches top 20% ranking, we stop and use that as the final submission code file

<h1 span style="color:Green"> Have fun! 🎉🎉 <br> Go ham on different techniques you always wanted to try!!</h1>

When else will we get an opportunity where we've already done an assignment and have so much time available to experiment

### Next Steps

1. For the next meeting, have datasets ready that you would like to explore (please use Kaggle as it makes things easier)
2. You can reuse the topics that were shared in MMA 860
3. Please add your topics under your names along with the kaggle dataset link in this <a href="https://queensuca.sharepoint.com/:w:/r/teams/GROUP-MMA2026-Istanbul/_layouts/15/Doc.aspx?sourcedoc=%7Bb22a90d7-ee6d-4776-b242-3cd0f886afe6%7D&action=editnew">word document </a>
4. We will be doing a poll in the next meeting to pick the final topic