## Length of the code {-}
No restriction

**Delete this section from the report, when using this template.** 

In [13]:
# loading libraries
import pandas as pd
import numpy as np

In [2]:
# loading data
games = pd.read_csv("data_raw/game.csv")
team_info = pd.read_csv("data_raw/team_info.csv")

## Data quality check / cleaning / preparation 

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.** An example is given below.

### Data quality check
*By Elton John*

The code below visualizes the distribution of all the variables in the dataset, and their association with the response.

In [7]:
#...Distribution of continuous variables...#

In [8]:
#...Distribution of categorical variables...#

In [9]:
#...Association of the response with the predictors...#

### Data cleaning
*By Tess Wagner*

From the data quality check we realized that:

1. We needed to keep data only from the games of the ten seasons we wanted to use to build our model. We are keeping data from the 2007-08 season through the 2017-18 season. 
2. We can drop the columns in `game.csv` and `team_info.csv` that we know we will not use (`venue_link`, `franchiseId`, `abbreviation`, `link`).
3. To get a full team name, we combined the variables `teamName` and `shortName` (and fixed the NY team names).
4. To make it easier to identify which team played which games, we replaced `team_id` (a numerical value) in the game dataset with `team_name` from the team info dataset.
5. We separated the games dataset into regular season games so that we do not factor playoff game data into our model.
6. We dropped `team_id` from the team names dataset as it is no longer useful.

The code below implements the above cleaning.

In [3]:
games = pd.read_csv("data_raw/game.csv")
team_info = pd.read_csv("data_raw/team_info.csv")

In [4]:
games = games[(games['season'] >= 20072008) & (games['season'] <= 20172018)].sort_values('season')

In [5]:
games.drop('venue_link', axis = 1, inplace = True)
team_info.drop(['franchiseId', 'abbreviation', 'link'], axis = 1, inplace = True)

In [9]:
team_info['team_name'] = team_info['shortName'] + ' ' + team_info['teamName']
team_info.replace(['NY Rangers Rangers', 'NY Islanders Islanders'], ['New York Rangers', 'New York Islanders'], inplace = True)
team_info.drop(['shortName', 'teamName'], axis = 1, inplace = True)

In [11]:
list_of_ids = team_info['team_id'].values

for Id in list_of_ids:
    games.away_team_id.replace(Id, team_info[team_info['team_id'] == Id]['team_name'].values[0], inplace = True)
    games.home_team_id.replace(Id, team_info[team_info['team_id'] == Id]['team_name'].values[0], inplace = True)

In [18]:
regular_season_games = games[games['type'] == 'R']

In [None]:
team_names = team_info.drop('team_id', axis = 1)

### Data preparation
*By Sankaranarayanan Balasubramanian and Chun-Li*

The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:

1. Since we need to predict house price, we derived some new predictors *(from existing predictors)* that intuitively seem to be helpuful to predict house price. 

2. We have shuffled the dataset to prepare it for K-fold cross validation.

3. We have created a standardized version of the dataset, as we will use it to develop Lasso / Ridge regression models.

In [3]:
######---------------Creating new predictors----------------#########

#Creating number of bedrooms per unit floor area

#Creating ratio of bathrooms to bedrooms

#Creating ratio of carpet area to floor area

In [None]:
######-----------Shuffling the dataset for K-fold------------#########

In [None]:
######-----Standardizing the dataset for Lasso / Ridge-------#########

## Exploratory data analysis

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

## Developing the model

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

### Code fitting the final model

Put the code(s) that fit the final model(s) in separate cell(s), i.e., the code with the `.ols()` or `.logit()` functions.

## Conclusions and Recommendations to stakeholder(s)

You may or may not have code to put in this section. Delete this section if it is irrelevant.