# Data Scientist Candidate Problem Set
 
## Overview

This challenge is broken up into 3 Parts. This notebook will be the final document that you will submit once complete. You should be able to complete the challenge in 3 hours or less. The goals of each part are

- **Part 1**: To assess your technical capabilities regarding basic data manipulation.
- **Part 2**: To assess your technical capabilities for applying standard machine learning techniques to a simple data set. Keep in mind that a reasonable, working model is better than a novel solution.
- **Part 3**: To assess your approach to problem solving and relating machine learning to real world contexts.

In order to understand your thinking processes, please elaborate on your methods using concise markdown and/or in-code comments, however avoid commenting on what is self evident from the code. Examples of the types of commentary we are looking for are

- Assumptions made about the data
- Insights observed from the data set during exploration
- Anything you might have done differently or added with more time

Add additional cells as needed to arrive at your solutions. While Question 1 should be answered using [pandas](https://pandas.pydata.org/), you can use any package you wish for Question 2 - just be sure to provide a `requirements.txt` file with the versions of any additional libraries needed to run the notebook.

**Note**: The notebook should be capable of being run end to end, without error, assuming that the files `township_train.csv` and `township_test.csv` are in the same directory and `pip install -r requirements.txt` has been run.


## Data Background:
Attached to this challenge are two CSVs: 

1. `township_train.csv`: Overall historical data on historical properties of townships and the number of townhomes within

1. `township_test.csv`: A subset of the "train" data set used for predictions only in question **2C**

##### AJS93 Question, should test be a subset of train? shouldnt it be a holdout?

Below is a data dictionary of the columns:
- `county_index`: anonymized ID corresponding to a particular county
- `township_index`: anonymized ID corresponding to a particular township
- `median_income`: the median income of the residents in the township
- `count_population`: the total number of residents in the township
- `count_large_companies`: the total number of companies located  in the township with 200 employees or more
- `count_parks`: the total number of public parks in the township
- `count_traffic_lights`: the total number of intersections with traffic lights in the township
- `municipal_budget`: the quarterly budget of the township
- `count_breweries`: the total number of breweries in the township
- `schools_per_10000`: the ratio of the number of schools per 10,000 residents in the township
- `municipality_type`: a holistic classification of the township
- `count_townhomes`: the number of townhomes in the township

**Note:** Townhomes reside in Townships, and Townships reside in Counties

**Note**:
Assumes this has been run:

$ `pip install -r requirements.txt`

# Imports

In [3]:
import pandas as pd

# View data

In [2]:
#Read CSV file using pandas
#source: https://towardsdatascience.com/how-to-read-csv-file-using-pandas-ab1f5e7e7b58
pd.read_csv('township_train.csv')

Unnamed: 0,county_index,township_index,median_income,count_population,count_large_companies,count_parks,count_traffic_lights,municipal_budget,count_breweries,schools_per_10000,municipality_type,count_townhomes
0,7,0,123449.05,37785.0,9.0,18.0,33.0,89927.99,9.0,1.587932,urban,56.0
1,5,1,115068.28,96363.0,8.0,20.0,41.0,38657.75,4.0,0.415097,rural,38.0
2,9,2,122436.73,72146.0,5.0,8.0,37.0,40942.86,9.0,0.693039,urban,25.0
3,3,3,118320.89,52383.0,14.0,18.0,52.0,82628.78,2.0,1.145410,suburban,69.0
4,7,4,80629.23,102755.0,13.0,17.0,26.0,103646.80,2.0,0.389275,rural,54.0
...,...,...,...,...,...,...,...,...,...,...,...,...
402,4,427,119405.71,74273.0,18.0,17.0,39.0,71142.60,3.0,0.538554,rural,67.0
403,6,428,110886.78,77521.0,11.0,20.0,31.0,37959.46,5.0,0.773984,rural,49.0
404,4,429,109351.85,30398.0,10.0,15.0,28.0,56627.17,10.0,1.973814,urban,47.0
405,5,430,68525.73,40997.0,11.0,11.0,28.0,114728.23,4.0,2.195283,rural,64.0


## Question 1: Python/Pandas
Using python and `pandas` methodology and the `township_train.csv` data, create a
table that captures the following:

**A.** The average population for each county: The result should be a `DataFrame` with the columns: `['county_index', 'average_population']`.


In [3]:
df = pd.read_csv('township_train.csv')

#get number of counties
counties = set(df['county_index'])

aves = []
for c in counties:
    pop = []
    #https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
    #"DataFrame.iterrows is a generator which yields both the index and row (as a Series):"
    for index, row in df.iterrows():
        #print(row['county_index'], row['count_population'])
        if row['county_index'] == c:
            #print(row['count_population'])
            pop.append(row['count_population'])
            
    aves.append(sum(pop)/len(pop))

#Save as dataframe
#https://cmdlinetips.com/2018/01/how-to-create-pandas-dataframe-from-multiple-lists/
DataFrame = pd.DataFrame({'county_index':list(counties),'average_population':aves})

DataFrame

Unnamed: 0,county_index,average_population
0,0,80625.25
1,1,71472.0
2,2,88172.333333
3,3,101270.333333
4,4,83148.52459
5,5,86905.31746
6,6,71955.62
7,7,74647.783784
8,8,74175.029412
9,9,103285.561404


**B.** For each municipality type what is the range of breweries per township: The result should be a `DataFrame` with the columns: `['municipality_type', 'min_breweries', 'max_breweries']`

In [4]:
#Repeat process from A
#this time instead of getting average, we save the max and min of each breweries list

df = pd.read_csv('township_train.csv')
munis = set(df['municipality_type'])
maxs = []
mins = []

for m in munis:
    bews = []
    for index, row in df.iterrows():
        if row['municipality_type'] == m:
            bews.append(row['count_breweries'])
            
    maxs.append(max(bews))
    mins.append(min(bews))

DataFrame = pd.DataFrame({'county_index':list(munis),'min_breweries':mins,'max_breweries':maxs})

DataFrame

Unnamed: 0,county_index,min_breweries,max_breweries
0,suburban,0.0,30.0
1,urban,0.0,30.0
2,rural,0.0,30.0


**C.** The number of schools by county: The result should be a `DataFrame` with the columns: `['county_index', 'count_schools']`

In [5]:
#Repeat process from A
#to get school count we have to do some reverse engineering
#and extrapulate from 'count_population' and 'schools_per_10000'

df = pd.read_csv('township_train.csv')
counties = set(df['county_index'])
schools = []

for c in counties:
    s = []
    for index, row in df.iterrows():
        if row['county_index'] == c:
            s.append(int((row['count_population']/10000)*row['schools_per_10000']))
    schools.append(sum(s))
DataFrame = pd.DataFrame({'county_index':list(counties),'count_schools':schools})

DataFrame

Unnamed: 0,county_index,count_schools
0,0,48
1,1,26
2,2,40
3,3,17
4,4,365
5,5,370
6,6,290
7,7,212
8,8,378
9,9,369


## Question 2: Modeling
You are hired as a data scientist by a general contractor that constructs townhome communities. These communities are of various sizes as well as various prices and quality.

The contractor provides you with the `township_train.csv` dataset which contains the number of townhomes in each township along with various features about the township.

Your goal is to perform the following:

# imports

In [4]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

## Set up the data

In [23]:
#get Y (our feature of interest in predicting) and X (data about feature)
#https://medium.com/codex/how-to-set-x-and-y-in-pandas-3f38584e9bed
df = pd.read_csv('township_train.csv')
y_col = 'count_townhomes' #feature of interest
y = df[y_col] 
X = df[df.columns.drop(y_col)]


#handle catigorical data
#https://www.kaggle.com/getting-started/27270'
encode_col = 'municipality_type'
X['municipality_type_encoded'] = LabelEncoder().fit_transform(X['municipality_type'])
X = X[X.columns.drop(encode_col)]


# Split features and target into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

#further reading
#https://towardsdatascience.com/data-normalization-with-pandas-and-scikit-learn-7c1cc6ed6475

**A.** Construct a model that is capable of predicting the number of townhomes in each township.

Include any Exploratory Data Analysis (EDA), feature engineering and model training code here. Feel free to organize the contents


In [24]:
RF = RandomForestRegressor()
RF.fit(X_train, y_train);

**B.** Provide evaluations of the model's performance

In [25]:
#https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
#https://stackoverflow.com/questions/50789508/random-forest-regression-how-do-i-analyse-its-performance-python-sklearn
#https://cnvrg.io/random-forest-regression/

pred = RF.predict(X_test)
print('MAE: ', mean_absolute_error(y_test, pred))
print('MSE: ', mean_squared_error(y_test, pred)) 

MAE:  6.2507843137254895
MSE:  66.03041372549019


**A&B alternative**

Same expirment with support vector machine doesn't do as well

In [26]:
#Support vector machine
from sklearn.svm import SVR
#https://medium.com/pursuitnotes/support-vector-regression-in-6-steps-with-python-c4569acd062d
#also has good info on feature cleaning
regressor = SVR(kernel='rbf')
regressor.fit(X,y)
pred = regressor.predict(X_test)
print('MAE: ', mean_absolute_error(y_test, pred))
print('MSE: ', mean_squared_error(y_test, pred))

MAE:  10.657976628866644
MSE:  181.29109887209552


**C.** Using the `township_test.csv`, predict the number of townhomes for each township.

In [28]:
#get our testing data
test_df = pd.read_csv('township_test.csv')

#handle catigorical data
encode_col = 'municipality_type'
test_df['municipality_type_encoded'] = LabelEncoder().fit_transform(test_df['municipality_type'])
test_df = test_df[test_df.columns.drop(encode_col)]

RF.predict(test_df)

array([53.46, 47.29, 52.25, 75.04, 55.66, 55.4 , 67.93, 53.21, 49.73,
       35.4 , 40.31, 75.44, 50.13, 50.43, 33.51, 55.6 , 49.32, 41.63,
       77.02, 30.12, 44.19, 87.29, 51.83, 45.64, 45.39])

**Note**:

we should read these results as +/- 6, the Mean Absolulte Error (MAE) obtained in part B

# Question 3: Business Insight
The results of your model need to be reported back to the contractor.

Using the information you have gathered from creating and validating the model as well as your own intuition, please answer the following questions:

**A.** What are some considerations to be made when determining a location for the construction of a new townhome community?

##### You could look at each county and which townships have a good number of townhomes. If a high number of townhomes is a good then (e.g. indicates good conditions for townhomes), then that might be a good place to locate. On the other hand, if a high number of townhomes indicates oversaturation, then finding a place that has fewer than predicted townhomes may indicate an underserved market. 

**B.** Knowing that townhomes can be built at a variety of price and quality, what considerations would you have if you were required to recommend a townhome of a specific quality type to a location?

##### If your goal is to create houses of commensurate quality as those around it, then creating a model that predicted the qualitly could be set up in a similar way to the process in part 2, this time with classifiers. Though I think it would be more dependent on much more low level data about the streets than on a township level. This data might also be able to show where townships need more schools or have too many bars for example.

**C.** What other information could the contractor have provided that could have enabled the construction of a more informative model?

##### Other statisics such as family size or infomation about schools quality and what kind of business are in the area. If you used map data you could create a computer vison model. 

# Wrap-up

**Note**:
I havent used requirments.txt document before, so here was some stuff I found about about:

https://towardsdatascience.com/requirements-vs-setuptools-python-ae3ee66e28af#:~:text=txt%20file-,The%20requirements.,of%20dependencies%2C%20as%20discussed%20previously.

I also found that there is a nice quality of life python tool that can get this information for you called session_info

https://towardsdatascience.com/generating-a-requirements-file-jupyter-notebook-385f1c315b52#:~:text=The%20Pythonic%20Way%3A%20Pip%20Freeze&text=Simply%20open%20a%20terminal%20and,environment%20using%20venv%20or%20conda.&text=It%20will%20take%20every%20package%20you%20have%20installed%20on%20that%20environment.



In [6]:
#download session info in jupiter
#!pip install session-info
#import session_info
#session_info.show()