Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [X] Choose your target. Which column in your tabular dataset will you predict?
- [X] Is your problem regression or classification?
- [X] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [X] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [X] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- [X] Begin to clean and explore your data.
- [X] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

In [20]:
# The goal of this program is to determine the likelihood of Early Access Games to
# leave Early Access.

import pandas as pd
import pandas_profiling as pdp


def google_drive_useable(link):
    return link.replace('/open?', '/uc?')

data = r'https://drive.google.com/open?id=1pEOgqOZcgxwu7gA4GnFuMwO_wVBLQ7cf'
df   = pd.read_csv(google_drive_useable(data))
df.profile_report(style = {'full_width': True})



In [18]:
df.head()

Unnamed: 0,QueryName,ReleaseDate,RequiredAge,DemoCount,DeveloperCount,DLCCount,Metacritic,MovieCount,PackageCount,RecommendationCount,...,GenreIsSports,GenreIsRacing,GenreIsMassivelyMultiplayer,GenreIsEarlyAccess,ReleaseType,PriceInitial,PriceFinal,AboutText,HeaderImage,SupportedLanguages
0,PLAYERUNKNOWN'S BATTLEGROUNDS,Nov 1 2000,0,0,1,0,88,0,1,68991,...,False,False,False,False,Ex Early Access,9.99,9.99,Play the worlds number 1 online action game. E...,http://cdn.akamai.steamstatic.com/steam/apps/1...,English French German Italian Spanish Simplifi...
1,Unturned,Apr 1 1999,0,0,1,0,0,0,1,2439,...,False,False,False,False,Ex Early Access,4.99,4.99,One of the most popular online action games of...,http://cdn.akamai.steamstatic.com/steam/apps/2...,English French German Italian Spanish
2,Wallpaper Engine,May 1 2003,0,0,1,0,79,0,1,2319,...,False,False,False,False,Ex Early Access,4.99,4.99,Enlist in an intense brand of Axis vs. Allied ...,http://cdn.akamai.steamstatic.com/steam/apps/3...,English French German Italian Spanish
3,Don't Starve Together,Jun 1 2001,0,0,1,0,0,0,1,888,...,False,False,False,False,Ex Early Access,4.99,4.99,Enjoy fast-paced multiplayer gaming with Death...,http://cdn.akamai.steamstatic.com/steam/apps/4...,English French German Italian Spanish
4,Rust,Nov 1 1999,0,0,1,0,0,0,1,2934,...,False,False,False,False,Ex Early Access,4.99,4.99,Return to the Black Mesa Research Facility as ...,http://cdn.akamai.steamstatic.com/steam/apps/5...,English French German Korean


In [22]:
# - [X] Choose your target. Which column in your tabular dataset will you predict?
target = 'ReleaseType'

In [None]:
# - [X] Is your problem regression or classification?

# Classification

In [None]:
# - [X] How is your target distributed?
#     - Classification: How many classes? Are the classes imbalanced?
#     - Regression: Is the target right-skewed? If so, you may want to log transform the target.

# 3 classes; Traditional Release, Early Access, Ex Early Access

In [None]:
# - [X] Choose which observations you will use to train, validate, and test your model.
#     - Are some observations outliers? Will you exclude them?
#     - Will you do a random split or a time-based split?

# I may need to filter down my data a lot harder than I'd like. Some products in this list are free,
# and so might not be applicable, maybe maybe they are.

# I only ahve about 670 of the 13000 rows that are actually Ex Early Access.
# I may need to pull them out, then split them and give half and half to each.
# Otherwise I might need to go in and check most of the games and filter them down more.

In [33]:
# - [ ] Choose your evaluation metric(s).
# Classification: Is your majority class frequency > 50% and < 70% ?
# If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading.
# What evaluation metric will you choose, in addition to or instead of accuracy?

# I need to determine a way to tell if a game is actually going to release, but
# as is I can just compare the released ones against unreleased.

types    = df['ReleaseType'].value_counts()
baseline = types[2] / types[1]
baseline

0.4564032697547684

In [None]:
# - [X] Begin to clean and explore your data.

# Already on it.

In [26]:
# - [X] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

features = ['Metacritic'
           ,'ReleaseDate'
           
           ,'RecommendationCount'
           ,'DeveloperCount'
           ,'PublisherCount'
           ,'DLCCount'
           ,'ScreenshotCount'
           
           ,'CategorySinglePlayer'
           ,'CategoryMultiplayer'
           ,'CategoryCoop'
           ,'CategoryMMO'
           ,'CategoryInAppPurchase'
           ,'CategoryIncludeSrcSDK'
           ,'CategoryIncludeLevelEditor'
           ,'CategoryVRSupport'
           
           ,'GenreIsIndie'
           ,'GenreIsAction'
           ,'GenreIsAdventure'
           ,'GenreIsCasual'
           ,'GenreIsStrategy'
           ,'GenreIsRPG'
           ,'GenreIsSimulation'
           ,'GenreIsSports'
           ,'GenreIsRacing'
           ,'GenreIsMassivelyMultiplayer'
           
           ,'PlatformWindows'
           ,'PlatformLinux'
           ,'PlatformMac'
           
           ,'PriceInitial'
           ,'PriceFinal'
           ]
df.columns

# Might need to remove:
# - traditional games
# - subscription games
# - free to play
# - 'Non Game'

Index(['QueryName', 'ReleaseDate', 'RequiredAge', 'DemoCount',
       'DeveloperCount', 'DLCCount', 'Metacritic', 'MovieCount',
       'PackageCount', 'RecommendationCount', 'PublisherCount',
       'ScreenshotCount', 'SteamSpyOwners', 'SteamSpyOwnersVariance',
       'SteamSpyPlayersEstimate', 'SteamSpyPlayersVariance',
       'AchievementCount', 'AchievementHighlightedCount', 'ControllerSupport',
       'IsFree', 'FreeVerAvail', 'PurchaseAvail', 'SubscriptionAvail',
       'PlatformWindows', 'PlatformLinux', 'PlatformMac', 'PCReqsHaveMin',
       'PCReqsHaveRec', 'LinuxReqsHaveMin', 'LinuxReqsHaveRec',
       'MacReqsHaveMin', 'MacReqsHaveRec', 'CategorySinglePlayer',
       'CategoryMultiplayer', 'CategoryCoop', 'CategoryMMO',
       'CategoryInAppPurchase', 'CategoryIncludeSrcSDK',
       'CategoryIncludeLevelEditor', 'CategoryVRSupport', 'GenreIsNonGame',
       'GenreIsIndie', 'GenreIsAction', 'GenreIsAdventure', 'GenreIsCasual',
       'GenreIsStrategy', 'GenreIsRPG', 'GenreIsSi