### Part 2

Load in your data set.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

%matplotlib inline


food1 = pd.read_csv('food_coded.csv')


Separate your target column from potential feature columns.


In [None]:
target_col = food1.loc[:, 'diet_current_coded']
feature_cols = food1.loc[:, ['exercise', 'weight', 'ideal_diet_coded', 'fav_food', 
                          'ethnic_food', 'on_off_campus', 'fav_cuisine_coded']]
food = food1.loc[:, ['diet_current_coded', 'exercise', 'weight', 'ideal_diet_coded',
                   'fav_food', 'ethnic_food', 'on_off_campus', 'fav_cuisine_coded']]
food.head()

Get those columns in a clean enough state that you can build a model with them. It is OK to be fast and loose at this stage, e.g. by simply dropping rows or columns that have missing values or have string values that would take some work to make usable.


In [None]:
food = food.dropna()
food.head()

In [None]:
food.weight.head()

In [None]:
food = food.drop(index=2)

In [None]:
food.head(3)

In [None]:
food = food.replace(food.loc[3,'weight'], value=240)


In [None]:
food.loc[61:71, 'weight']

In [None]:
food.head()

In [None]:
food = food.replace(food.loc[67,'weight'], value=144)


In [None]:
food.loc[67, 'weight']

In [None]:
food = food.astype({'weight': 'int64'})

In [None]:
food.weight.head()

Do a train/test split.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

In [None]:
food.reset_index()

In [None]:
target_col = 'diet_current_coded'
feature_cols = ['exercise', 'weight', 'ideal_diet_coded', 'fav_food', 'ethnic_food', 'on_off_campus', 'fav_cuisine_coded']

In [None]:
food.diet_current_coded.value_counts()

In [None]:
X = food.loc[:, feature_cols]
y = food.loc[:, target_col]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)

Fit some kind of regression or classification model on your training set. Be sure to choose the correct type: regression if your target variable is a number, classification if it is a category.


In [None]:
lr.fit(X_train, y_train)

Use an appropriate metric to evaluate your model on both the training set and the test set. 

Be sure to choose a regression metric for a regression problem (e.g. MSE, RMSE, MAE, R-squared) or a classification metric for a classification problem (e.g. accuracy, $F_1$).


In [None]:
y_pred =lr.predict(X_test)

In [None]:
from sklearn import metrics
print(metrics.mean_absolute_error(y_test, y_pred))
print(lr.score(X_test, y_test))

Compare your model's performance to that of a null model, e.g. by calculating R^2 for a regression model, comparing accuracy for a classification model to the frequency of the most common class, or calculating the same metric for your model and for a set of predictions that is simply the average value of the target variable for a regression model or the most common class for a classification model.


State whether this first-pass model appears to be overfitting or underfitting.


In [None]:
fig, ax = plt.subplots(figsize=(8, 4))
residuals = y_test - y_pred
ax.scatter(x=X_test.index, y=residuals, alpha=.1);

In [None]:
print(lr.intercept_)
print(list(zip(food.columns, lr.coef_)))

In [None]:
compare_to_actual = list(zip(lr.predict(X_test), y_test))
compare_to_actual

On first pass, the model seems to be underfitting.

## For each variable individually (including categorical variables):
 Look at the descriptive statistics.
 
 Visualize the distribution.
 
 Note which variables appear to be roughly normally distributed and which appear to be strongly skewed, as well as any other potentially important observations.  


In [None]:
food_counts = food.describe()
food_counts

In [None]:
food2 = food.drop('weight', axis=1)

food2.describe() #weight has such higher numbers than everything else, it's making it hard to view anything else. I might code it later to categories.

In [None]:
food_counts = food2.describe()
food_counts = food_counts.drop('count') #this isn't a meaningful metric for this exercise

In [None]:
food_counts.plot(kind="bar",figsize=(20,20))

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
sns.scatterplot(data=food_counts, ax=ax);

From looking at the descriptive statistics and the visualizations, ethnic_food seems to be somewhat skewed, fav_food a little less so. The standard deviation for everything else seems to be in the normal range.

I'm not getting a lot out of the current feature columns, so I want to take a quick look at the whole dataframe.

In [None]:
all_counts =food1.describe()
all_counts

In [None]:
axes = food1.plot(figsize=(20,20))

In [None]:
axes = food1.plot(kind='bar',figsize=(20,20))

## For each potential feature variable:

  Measure its correlation with the target variable.
  
  Visualize its relationship with the target variable.
  
  Note which feature variables appear to be roughly linearly related to the target, related to it but not linearly, and unrelated to it, as well as any other potentially important observations.


In [None]:
feature_cols =[]
feature = ''
def find_corr(feature_cols, feature):
    X = food[feature_cols]
    y = food['diet_current_coded']

    lr_feat = LinearRegression()
    lr_feat.fit(X, y)

    print(list(zip(food.columns, lr_feat.coef_)))
    ax = food.boxplot('diet_current_coded', by=feature_cols,
                     figsize=(10,5))
    axes= food.loc[:, [feature,'diet_current_coded']].plot(kind='scatter', x= feature, y='diet_current_coded', alpha=.5,figsize=(20,10))
find_corr(['weight'], 'weight')    

In [None]:
find_corr(['ideal_diet_coded'], 'ideal_diet_coded')

In [None]:
find_corr(['fav_food'],'fav_food')    

In [None]:
find_corr(['ethnic_food'],'ethnic_food')

In [None]:
find_corr(['on_off_campus'], 'on_off_campus')

In [None]:
sns.heatmap(food.corr(),
            vmin=-1,
            vmax=1,
            cmap=sns.diverging_palette(220, 10, n=21),
            );

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
sns.heatmap(food1.corr(),
            vmin=-1,
            vmax=1,
            cmap=sns.diverging_palette(220, 10, n=21),
            ax=ax,
            );

Without doing any data cleaning, nothing in this data set appears to be strongly correlated with the target variable. The only real strong correlations that are showing up are between "ethnic_foods", and other columns that would a fall under that category. Students were asked how likely they are to eat that type of food.  I'm going to need to inspect this data set in fine detail. Some columns have outliers that may be throwing off the correlation. 

This is a big limitation of that data, that there doesn't seem to be much correlation between the feature columns used so far and the target variable. There may simply not be a lot of variation in the data itself and by that I mean, maybe diet_current_coded was grouped together too liberally without enough nuance. Possibly just a wide variety of students eat very similarly.  

I will have try different feature columns.

I need to evaluate the original answers verses the coded ones. 

I will most likely need to use the non-numeric columns, but I may need to split some where the answers are lists, and I might need to figure out how to do some natural language processing. 

I may also need to change some columns to boolean.

I can also adjust the sensitivity of my model.

I may also need to create a column that gives squares of another column.

Mean Absolute Error is a better regression metric to use for this data, because a rare instance of a prediction off by a large margin is not that detrimental, but the majority of the predictions should have a small margin of error. 