# Import Data

<br>
<br>

Below are brief explanations of what data we are importing and what these variables represent
* `ALL_DATA` _(data frame)_ - is our main data set 
* `poke_types` _(data frame)_ - is a seperate data set that contains a mapping of Pokemon IDs to pokemon name and types. This will be important as our main dependant variable will be **Pokemon type**
* `pokemonId_ALL` _(list/array)_ - of all unique pokemon IDs that exist in `ALL_DATA` from smallest to biggest number. 

<br>
<br>
<br>
<br>


In [1]:
import pandas as pd # data frames
import numpy as np # ____number generation
import statsmodels.formula.api as smf # for linear modeling
import matplotlib.pyplot as plt # plotting

import os.path # check if data file already exists
import json # save dictionary strucs

In [2]:
ALL_DATA = pd.read_csv('data/300k.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
poke_types = pd.read_csv('data/pokeId.csv')
poke_types = poke_types[['#', "Name", "Type 1", "Type 2"]]
pokemonId_ALL = set(ALL_DATA.pokemonId)

<br>
<br>
<br>

## Small Helper Functions

In [5]:
def printAllFeatures():
    cols = ALL_DATA.columns
    for col in cols:
        print(col)

<br>
<br>
<br>

## Prepare Data

In [6]:
# Create a mapping of IDs to type, secondary type, and name. We need to do this in order to add the
# appropriate type to each pokemon in our main data set.

pokeId_toType = {}

if not os.path.exists("data/pokeId_toType.json"): 
    
    print("Looks like you don't have the Pokemon ID to pokemon Type+Name dictionary.\nLet me go build that for you.\nStarting... ")
    
    # build dictionary and save it for future uses
    for index, row in poke_types.iterrows():
        if row["#"] in pokemonId_ALL: #only add poke that are in main data set
            pokeId_toType[str(row["#"])] = [row["Name"] ,row["Type 1"], row["Type 2"]]
    
    print("Saving dictionary locally for future uses.")
    with open("data/pokeId_toType.json", 'w') as f:
        json.dump(pokeId_toType, f)
    print("...Done")
else:
    print("Good, you already have the Poke ID to Poke Type+Name dict. Move on.")    
    with open("data/pokeId_toType.json") as f:
        pokeId_toType = json.load(f)

Good, you already have the Poke ID to Poke Type+Name dict. Move on.


In [7]:
# Check if you have the merged data set. If you do not then build and save it locally.
# The print statements do a good job of informing what's going on.

if not os.path.exists("data/merged.csv"): 
    
    print("Hey looks like you're missing the merged data, let me build that for you, it'll take 2-5 minutes probably.\n")
    
    # Add the new columns
    ALL_DATA["Name"] = ""
    ALL_DATA["Type"] = "" 

    def update_row(row):
        tempTypes = pokeId_toType[str(row["pokemonId"])]   
        listy = [tempTypes[0], tempTypes[1]]
        return pd.Series(listy)

    print("Merging data...")
    ALL_DATA[['Name', 'Type']] = ALL_DATA.apply(update_row, axis=1)
    print("Done merging data.\n")

    print("Saving merge data set to ./data/merged.csv")
    ALL_DATA.to_csv("data/merged.csv",index=False)
    print("\n...Done saving data. Move on now.\n")
    ALL_DATA.head()
else:
    ALL_DATA = pd.read_csv('data/merged.csv')
    print("Good job, you already have the merged data. Move on.")

Good job, you already have the merged data. Move on.


<br>
<br>
<br>

### Encoding

We cannot use categorical data (strings) in feature selection or regressions.
The solution to this is to turn strings into numerical data mappings in order to be able to process them.
Based on my research there are three beginner friendly functions that do this. 

`sklearn` has the functions `LabelEncoder()` and `OneHotEncoder()` -- and `pandas` has its own functions `get_dummies()`.

I selected the `get_dummies()` as both it and my data frames are managed with the `pandas` library.

In [8]:
DATA_encoded = ALL_DATA.dropna()

# These columns are of type objects but are not helpful to us.
DATA_encoded = DATA_encoded.drop(["appearedLocalTime", "appearedDayOfWeek", "_id", "city", "weatherIcon", "Name"], axis=1) ########## Question from Maggie - Do we want to keep name, would it help our results or nah? 

# it's an object column but should be float: convert it.
DATA_encoded["pokestopDistanceKm"] = DATA_encoded["pokestopDistanceKm"].apply(pd.to_numeric, downcast='float', errors='coerce')

In [9]:
# anything of type object seems to be categorical, encode it.

# produce dummies for categorical columns, concat them to data, remove originals

# get_dummies() code:
# https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40

cols_toNotTouch = ["Type"]

for col in DATA_encoded.columns:
    if (str(DATA_encoded[col].dtype) == 'object') and (not(col in cols_toNotTouch)):
        
        print("Encoding <" + col + ">.")
        dummy = pd.get_dummies(DATA_encoded[col])
        DATA_encoded = DATA_encoded.drop([col], axis=1)
        DATA_encoded = pd.concat([DATA_encoded, dummy], axis=1)
print("...Done encoding.")

Encoding <appearedTimeOfDay>.
Encoding <continent>.
Encoding <weather>.
...Done encoding.


#### ONLY ONE SPLIT

In [10]:
# Split data
from sklearn.model_selection import train_test_split

train_X, test_X, train_Y, test_Y = train_test_split(
                                       DATA_encoded,      # features
                                        DATA_encoded.Type,                               
#     ALL_DATA["Type"],    # outcome ######################### CHANGE TO ---> DATA_encoded.Type 
                                       test_size=0.20, # percentage of data to use as the test set
                                       random_state=15 # set a random state so it is consistent (not required!)
                                                                            )

print("train features shape", train_X.shape)
print("test features shape", test_X.shape)
print()
print("train outcomes shape", train_Y.shape)
print("test outcomes shape", test_Y.shape)

train features shape (236816, 242)
test features shape (59205, 242)

train outcomes shape (236816,)
test outcomes shape (59205,)


In [11]:
train_X = train_X.dropna()
train_Y = train_X["Type"]

test_X = train_X.dropna()
test_Y = train_X["Type"]

train_X_for_modeling = train_X.copy()
test_X_for_modeling = test_X.copy()


train_X_results = train_X.Type
test_X_results = test_X.Type


# Put this in if (has Type column)
train_X = train_X.drop(["Type"], axis=1)
test_X = test_X.drop(["Type"], axis=1)

<br>
<br>
<br>

## Feature Selection - REVISIT LATER --- CHANGE OLS TO CATEGORIAL

https://stackoverflow.com/questions/30384995/randomforestclassfier-fit-valueerror-could-not-convert-string-to-float

In [19]:
import statsmodels.formula.api as smf

# Code from https://planspace.org/20150423-forward_selection_with_statsmodels/
def forward_selected(data, response):
    """Linear model designed by forward selection.

    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response

    response: string, name of response column in data

    Returns:
    --------
    model: an "optimal" fitted statsmodels linear model
           with an intercept
           selected by forward selection
           evaluated by adjusted R-squared
    """
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = 0.0, 0.0
    while remaining and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:
            formula = "{} ~ {} + 1".format(response,
                                           ' + '.join(selected + [candidate]))
            score = smf.ols(formula, data).fit().rsquared_adj
            scores_with_candidates.append((score, candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
    formula = "{} ~ {} + 1".format(response,
                                   ' + '.join(selected))
    model = smf.ols(formula, data).fit()
    return model

In [27]:
# model = forward_selected(new_df, 'total_cases')
# print(model.model.formula)
# print(model.rsquared_adj)
# model.summary()


# model = forward_selected(train_X, "Type")
# print(model.model.formula)
# print(model.rsquared_adj)
# model.summary()
train_X.Type

283775     Normal
172953     Poison
207540    Psychic
8640       Ground
236838     Normal
215075    Psychic
85849      Normal
76754      Normal
231024        Bug
225167      Grass
291219     Normal
210651      Water
235820    Psychic
107576      Water
271009        Bug
274955     Normal
103739        Bug
78193     Psychic
123810      Water
225617        Bug
132861        Bug
140647     Normal
184356     Normal
112858     Normal
241116    Psychic
37549       Water
134498      Grass
97985      Ground
276829        Bug
52617       Water
           ...   
80227      Normal
15311      Normal
22632      Normal
52530      Normal
282522      Water
35413       Ghost
71776      Normal
152479        Bug
94209         Bug
28774      Poison
206159        Bug
117692      Grass
1063         Fire
275305        Bug
106639     Normal
117514     Normal
90528         Bug
201631        Bug
133137      Grass
139485        Bug
45999      Normal
217718     Normal
30411      Normal
44231      Normal
35483     

In [23]:
# model = forward_selected(new_df, 'total_cases')
# print(model.model.formula)
# print(model.rsquared_adj)
# model.summary()


model = forward_selected(train_X, "Type")
print(model.model.formula)
print(model.rsquared_adj)
model.summary()

ValueError: shapes (236816,15) and (236816,15) not aligned: 15 (dim 1) != 236816 (dim 0)

# ------------------------------------ ALT FEATURE SELECTION BELOW--------------------------------------------

forward feature selection using Lasso regression:
https://mikulskibartosz.name/forward-feature-selection-in-scikit-learn-f6476e474ddd

* Uses regularization to prevent overfitting
* Sets the coefficients of unimportant variables to 0
* Don't forget to smash that subscribe button
* For every 'Like' on this jupyter notebook, an INFO student yodels "The iSchool is MySchool!"
* Delete the last three bullets before submitting

### Made a change in 'outcomes' variable

In [9]:
# # Split data
# from sklearn.model_selection import train_test_split

# train_X2, test_X2, train_Y2, test_Y2 = train_test_split(
    
#                                        DATA_encoded,      # features
#                                        DATA_encoded.Type, # outcomes
# #                                        TEST_DF,      # features
# #                                        TEST_DF.Type, # outcomes
#                                        test_size=0.20, # percentage of data to use as the test set
#                                        random_state=15 # set a random state so it is consistent (not required!)
#                                                                             )

# print("train features shape", train_X2.shape)
# print("test features shape", test_X2.shape)
# print()
# print("train outcomes shape", train_Y2.shape)
# print("test outcomes shape", test_Y2.shape)

train features shape (236816, 242)
test features shape (59205, 242)

train outcomes shape (236816,)
test outcomes shape (59205,)


In [36]:
train_X = train_X.dropna()

# train_X_results = train_X2.Type
# train_X = train_X2.drop("Type", axis=1)

# test_X_results = test_X2.Type
# test_X2 = test_X2.drop("Type", axis=1)

In [37]:
# # Create array of encoded Pokemon Types
# pokemon_unique_types = list(train_X2_results.unique())
# type_float = []
# for val in train_X2_results:
#     type_float.append(pokemon_unique_types.index(val))

In [None]:
# Code Reference: https://mikulskibartosz.name/forward-feature-selection-in-scikit-learn-f6476e474ddd
import seaborn as sns
import pandas as pd
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso

X = train_X.drop(["pokemonId", "class"], axis=1) # Drop columns that directly affect Pokemon Type
y = train_X_results
estimator = Lasso()
featureSelection = SelectFromModel(estimator)
featureSelection.fit(X, y)
selectedFeatures = featureSelection.transform(X)
selectedFeatures

In [13]:
# Get features that most affect the Pokemon Type predictions
X.columns[featureSelection.get_support()]


Index(['longitude', 'sunriseMinutesMidnight', 'sunsetMinutesMidnight',
       'population_density'],
      dtype='object')

## Ending Notes/Ides:

* The line above returns columns - 'longitude', 'sunriseMinutesMidnight', 'sunsetMinutesMidnight', 'population_density'
* Drop irrelevant columns in 'X' above for a more narrow scope of features when modeling iterations (out of all the columns, which weather data is best etc.)

# --------------END OF ALT FEATURE SELECTION --------------------------------------------

### Decision Tree

In [81]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dtree_model = DecisionTreeClassifier(max_depth = 2).fit(train_X, train_Y) 
dtree_predictions = dtree_model.predict(test_X) 
accuracy_score(test_Y, dtree_predictions)

### KNN Neighbors

In [86]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_model = knn_clf.fit(train_X, train_Y)
preds = knn_model.predict(test_X)
accuracy_score(test_Y, preds)

## Multiclass regression

### SVC Regression

Meant for working with categorical data.
Grab subset sample data for SVC regression.

In [16]:
sample_train_X = train_X_for_modeling.sample(frac=0.01)
sample_train_Y = sample_train_X.Type
sample_train_X = sample_train_X.drop("Type", axis=1)

sample_test_X = test_X_for_modeling.sample(frac=0.01)
sample_test_Y = sample_test_X.Type
sample_train_X = sample_test_X.drop("Type", axis=1)

In [21]:
unique_types = sample_train_Y.unique()

encoded_sample_train_Y = []
for val in sample_train_Y:
    encoded_sample_train_Y.append(unique_types.index(val))

AttributeError: 'numpy.ndarray' object has no attribute 'index'

In [None]:
from sklearn.svm import SVC 
svm_model_linear = SVC(C = 1).fit(sample_train_X, sample_train_Y) 
svm_predictions = svm_model_linear.predict(sample_test_X) 
accuracy = svm_model_linear.score(sample_test_X, sample_test_Y)



### Grid Search

Determine best neighbors and depth level for models

In [12]:
from sklearn.model_selection import GridSearchCV

In [17]:
from sklearn.neighbors import KNeighborsClassifier

# for KNN
param_grid = {'n_neighbors':range(1, 6), 'weights':["uniform", "distance"]}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, return_train_score=True)
grid_search.fit(sample_train_X, sample_train_Y)
knn_params = grid_search.cv_results_['params'][grid_search.best_index_]
knn_score = grid_search.score(sample_test_X, sample_test_Y)
print("KNN params:", knn_params, "KNN score:", knn_score)



ValueError: could not convert string to float: 'Normal'

In [None]:
# for Decision Tree
param_grid_tree = {'random_state': np.arange(1, 5)}
tree_grid = GridSearchCV(DecisionTreeClassifier(), param_grid_tree, cv=5, return_train_score=True)
tree_grid.fit(train_X, train_Y)
tree_params = tree_grid.cv_results_['params'][tree_grid.best_index_]
tree_score = tree_grid.score(test_X, test_Y)
print("Tree params:", tree_params, "Tree score:", tree_score)

<br>
<br>
<br>

### Ending Thoughts

> feature selection - what to do?
> compare accuracies
> See per type - create results df, compare how accurate predictions are for <type>