Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [8]:
from google.colab import files

uploaded = files.upload()

Saving Video_Games_Sales_as_at_22_Dec_2016.csv to Video_Games_Sales_as_at_22_Dec_2016.csv


In [0]:
import pandas as pd

df = pd.read_csv('Video_Games_Sales_as_at_22_Dec_2016.csv')

In [102]:
df.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


In [103]:
df['Critic_Score'].describe()

count    8137.000000
mean       68.967679
std        13.938165
min        13.000000
25%        60.000000
50%        71.000000
75%        79.000000
max        98.000000
Name: Critic_Score, dtype: float64

In [104]:
df['Critic_Score'].isna().sum()

8582

In [0]:
df = df.dropna(subset=['Critic_Score'])

In [0]:
df['High_Critic_Score'] = df['Critic_Score'] >= 80

In [107]:
df['High_Critic_Score'].value_counts(normalize=True)

False    0.752734
True     0.247266
Name: High_Critic_Score, dtype: float64

In [108]:
df.isna().sum()

Name                    0
Platform                0
Year_of_Release       154
Genre                   0
Publisher               4
NA_Sales                0
EU_Sales                0
JP_Sales                0
Other_Sales             0
Global_Sales            0
Critic_Score            0
Critic_Count            0
User_Score             38
User_Count           1120
Developer               6
Rating                 83
High_Critic_Score       0
dtype: int64

In [0]:
import numpy as np

def correct_user_score(score):
  if score == 'tbd':
    return np.NaN
  
  else:
    return float(score)

In [0]:
df['User_Score'] = df['User_Score'].apply(correct_user_score)

In [0]:
top_25_publishers = df['Publisher'].value_counts(ascending=False)[:25].index

def publisher_top_25(publisher):
  if publisher in top_25_publishers:
    return publisher
  else:
    return "Other"

In [0]:
df['Publisher'] = df['Publisher'].apply(publisher_top_25)

In [0]:
df['Developer'] = df['Developer'].fillna("Missing")

In [0]:
ea  = df['Developer'].str.contains("EA ")
ubisoft = df['Developer'].str.contains("Ubisoft")

df.loc[ea, 'Developer'] = "Electronic Arts"
df.loc[ubisoft, 'Developer'] = 'Ubisoft'

In [0]:
top_25_developers = df['Developer'].value_counts(ascending=False)[:25].index

def developer_top_25(developer):
  if developer in top_25_developers:
    return developer
  else:
    return "Other"


df['Developer'] = df['Developer'].apply(in_top_25)

In [117]:
df['Year_of_Release'].value_counts()

2008.0    715
2007.0    692
2005.0    655
2009.0    651
2002.0    627
2006.0    620
2003.0    585
2004.0    561
2011.0    500
2010.0    500
2001.0    326
2012.0    321
2013.0    273
2014.0    261
2016.0    232
2015.0    225
2000.0    143
1999.0     39
1998.0     28
1997.0     17
1996.0      8
1994.0      1
1985.0      1
1992.0      1
1988.0      1
Name: Year_of_Release, dtype: int64

In [0]:
train = df[(df['Year_of_Release']!= 2015) & (df['Year_of_Release'] != 2016)]
val = df[df['Year_of_Release']==2015]
test = df[df['Year_of_Release']==2016]

In [119]:
train.shape, val.shape, test.shape

((7680, 17), (225, 17), (232, 17))

In [121]:
train.head(1)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating,High_Critic_Score
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E,False


In [0]:
!pip install category_encoders==2.*

In [0]:
target = 'High_Critic_Score'

X_train = train.drop(columns=['Name', 'Critic_Score', target])
y_train = train[target]

X_val = val.drop(columns=['Name', 'Critic_Score', target])
y_val = val[target]


In [127]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier


pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=10)
)


pipeline.fit(X_train, y_train)
print(f'Val score (acc): {pipeline.score(X_val, y_val)}')

Val score (acc): 0.7822222222222223
