# Import Data

<br>
<br>

Below are brief explanations of what data we are importing and what these variables represent
* `ALL_DATA` _(data frame)_ - is our main data set 
* `poke_types` _(data frame)_ - is a seperate data set that contains a mapping of Pokemon IDs to pokemon name and types. This will be important as our main dependant variable will be **Pokemon type**
* `pokemonId_ALL` _(list/array)_ - of all unique pokemon IDs that exist in `ALL_DATA` from smallest to biggest number. 

<br>
<br>
<br>
<br>


In [198]:
import pandas as pd # data frames
import numpy as np # ____number generation
import statsmodels.formula.api as smf # for linear modeling
import matplotlib.pyplot as plt # plotting

import os.path # check if data file already exists
import json # save dictionary strucs

In [199]:
ALL_DATA = pd.read_csv('data/300k.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [200]:
poke_types = pd.read_csv('data/pokeId.csv')
poke_types = poke_types[['#', "Name", "Type 1", "Type 2"]]

pokemonId_ALL = set(ALL_DATA.pokemonId)

<br>
<br>
<br>

## Small Helper Functions

In [201]:
def printAllFeatures():
    cols = ALL_DATA.columns
    for col in cols:
        print(col)

<br>
<br>
<br>

## Prepare Data

In [202]:
# Create a mapping of IDs to type, secondary type, and name. We need to do this in order to add the
# appropriate type to each pokemon in our main data set.

pokeId_toType = {}

if not os.path.exists("data/pokeId_toType.json"): 
    
    print("Looks like you don't have the Pokemon ID to pokemon Type+Name dictionary.\nLet me go build that for you.\nStarting... ")
    
    # build dictionary and save it for future uses
    for index, row in poke_types.iterrows():
        if row["#"] in pokemonId_ALL: #only add poke that are in main data set
            pokeId_toType[str(row["#"])] = [row["Name"] ,row["Type 1"], row["Type 2"]]
    
    print("Saving dictionary locally for future uses.")
    with open("data/pokeId_toType.json", 'w') as f:
        json.dump(pokeId_toType, f)
    print("...Done")
else:
    print("Good, you already have the Poke ID to Poke Type+Name dict. Move on.")    
    with open("data/pokeId_toType.json") as f:
        pokeId_toType = json.load(f)

Good, you already have the Poke ID to Poke Type+Name dict. Move on.


In [203]:
# Check if you have the merged data set. If you do not then build and save it locally.
# The print statements do a good job of informing what's going on.

if not os.path.exists("data/merged.csv"): 
    
    print("Hey looks like you're missing the merged data, let me build that for you, it'll take 2-5 minutes probably.\n")
    
    # Add the new columns
    ALL_DATA["Name"] = ""
    ALL_DATA["Type"] = "" 

    def update_row(row):
        tempTypes = pokeId_toType[str(row["pokemonId"])]   
        listy = [tempTypes[0], tempTypes[1]]
        return pd.Series(listy)

    print("Merging data...")
    ALL_DATA[['Name', 'Type']] = ALL_DATA.apply(update_row, axis=1)
    print("Done merging data.\n")

    print("Saving merge data set to ./data/merged.csv")
    ALL_DATA.to_csv("data/merged.csv",index=False)
    print("\n...Done saving data. Move on now.\n")
    ALL_DATA.head()
else:
    ALL_DATA = pd.read_csv('data/merged.csv')
    print("Good job, you already have the merged data. Move on.")

Good job, you already have the merged data. Move on.


### Encoding

We cannot use categorical data (strings) in feature selection or regressions.
The solution to this is to turn strings into numerical data mappings in order to be able to process them.
Based on my research there are three beginner friendly functions that do this. 

`sklearn` has the functions `LabelEncoder()` and `OneHotEncoder()` -- and `pandas` has its own functions `get_dummies()`.

I selected the `get_dummies()` as both it and my data frames are managed with the `pandas` library.

In [205]:
DATA_encoded = ALL_DATA.dropna()

# These columns are of type objects but are not helpful to us.
DATA_encoded = DATA_encoded.drop(["appearedLocalTime", "appearedDayOfWeek", "_id", "city", "weatherIcon", "Name"], axis=1) ########## Question from Maggie - Do we want to keep name, would it help our results or nah? 

# it's an object column but should be float: convert it.
DATA_encoded["pokestopDistanceKm"] = DATA_encoded["pokestopDistanceKm"].apply(pd.to_numeric, downcast='float', errors='coerce')

In [206]:
# anything of type object seems to be categorical, encode it.

# produce dummies for categorical columns, concat them to data, remove originals

# get_dummies() code:
# https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40

cols_toNotTouch = ["Type"]

for col in DATA_encoded.columns:
    if (str(DATA_encoded[col].dtype) == 'object') and (not(col in cols_toNotTouch)):
        
        print("Encoding <" + col + ">.")
        dummy = pd.get_dummies(DATA_encoded[col])
        DATA_encoded = DATA_encoded.drop([col], axis=1)
        DATA_encoded = pd.concat([DATA_encoded, dummy], axis=1)
print("...Done encoding.")

Encoding <appearedTimeOfDay>.
Encoding <continent>.
Encoding <weather>.
...Done encoding.


In [155]:
# Split data
from sklearn.model_selection import train_test_split

train_X, test_X, train_Y, test_Y = train_test_split(
                                       DATA_encoded,      # features
                                       ALL_DATA["Type"],    # outcome ######################### CHANGE TO ---> DATA_encoded.Type 
                                       test_size=0.20, # percentage of data to use as the test set
                                       random_state=15 # set a random state so it is consistent (not required!)
                                                                            )

print("train features shape", train_X.shape)
print("test features shape", test_X.shape)
print()
print("train outcomes shape", train_Y.shape)
print("test outcomes shape", test_Y.shape)

train features shape (236816, 242)
test features shape (59205, 242)

train outcomes shape (236816,)
test outcomes shape (59205,)


In [182]:
train_X = train_X.drop(["Type"], axis=1)
test_X = test_X.drop(["Type"], axis=1)

<br>
<br>
<br>

## Feature Selection

https://stackoverflow.com/questions/30384995/randomforestclassfier-fit-valueerror-could-not-convert-string-to-float

In [59]:
import statsmodels.formula.api as smf

# Code from https://planspace.org/20150423-forward_selection_with_statsmodels/
def forward_selected(data, response):
    """Linear model designed by forward selection.

    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response

    response: string, name of response column in data

    Returns:
    --------
    model: an "optimal" fitted statsmodels linear model
           with an intercept
           selected by forward selection
           evaluated by adjusted R-squared
    """
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = 0.0, 0.0
    while remaining and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:
            formula = "{} ~ {} + 1".format(response,
                                           ' + '.join(selected + [candidate]))
            score = smf.ols(formula, data).fit().rsquared_adj
            scores_with_candidates.append((score, candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
    formula = "{} ~ {} + 1".format(response,
                                   ' + '.join(selected))
    model = smf.ols(formula, data).fit()
    return model

In [90]:
# model = forward_selected(new_df, 'total_cases')
# print(model.model.formula)
# print(model.rsquared_adj)
# model.summary()


model = forward_selected(train_X, "Type")
print(model.model.formula)
print(model.rsquared_adj)
model.summary()

KeyError: 'Type'

# ------------------------------------ ALT FEATURE SELECTION BELOW--------------------------------------------

forward feature selection using Lasso regression:
https://mikulskibartosz.name/forward-feature-selection-in-scikit-learn-f6476e474ddd

* Uses regularization to prevent overfitting
* Sets the coefficients of unimportant variables to 0
* Don't forget to smash that subscribe button
* For every 'Like' on this jupyter notebook, an INFO student yodels "The iSchool is MySchool!"
* Delete the last three bullets before submitting

### Made a change in 'outcomes' variable

In [251]:
# Split data
from sklearn.model_selection import train_test_split

train_X2, test_X2, train_Y2, test_Y2 = train_test_split(
    
                                       DATA_encoded,      # features
                                       DATA_encoded.Type, # outcomes
#                                        TEST_DF,      # features
#                                        TEST_DF.Type, # outcomes
                                       test_size=0.20, # percentage of data to use as the test set
                                       random_state=15 # set a random state so it is consistent (not required!)
                                                                            )

print("train features shape", train_X2.shape)
print("test features shape", test_X2.shape)
print()
print("train outcomes shape", train_Y2.shape)
print("test outcomes shape", test_Y2.shape)

train features shape (236816, 242)
test features shape (59205, 242)

train outcomes shape (236816,)
test outcomes shape (59205,)


In [208]:
train_X2 = train_X2.dropna()

train_X2_results = train_X2.Type
train_X2 = train_X2.drop("Type", axis=1)

test_X2_results = test_X2.Type
test_X2 = test_X2.drop("Type", axis=1)

In [209]:
# Create array of encoded Pokemon Types
pokemon_unique_types = list(train_X2_results.unique())
type_float = []
for val in train_X2_results:
    type_float.append(pokemon_unique_types.index(val))

In [210]:
# Code Reference: https://mikulskibartosz.name/forward-feature-selection-in-scikit-learn-f6476e474ddd
import seaborn as sns
import pandas as pd
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso

X = train_X2.drop(["pokemonId", "class"], axis=1) # Drop columns that directly affect Pokemon Type
y = type_float
estimator = Lasso()
featureSelection = SelectFromModel(estimator)
featureSelection.fit(X, y)
selectedFeatures = featureSelection.transform(X)
selectedFeatures



array([[-93.294797, 398, 1186, 266.19315],
       [-115.074725, 375, 1144, 1361.0005],
       [-1.897839, 385, 1190, 3622.8896],
       ...,
       [-92.652612, 394, 1167, 21.806335],
       [-113.44095, 408, 1219, 761.8856],
       [14.472905, 387, 1178, 1055.8982]], dtype=object)

In [211]:
# Get features that most affect the Pokemon Type predictions
X.columns[featureSelection.get_support()]


Index(['longitude', 'sunriseMinutesMidnight', 'sunsetMinutesMidnight',
       'population_density'],
      dtype='object')

## Ending Notes/Ides:

* The line above returns columns - 'longitude', 'sunriseMinutesMidnight', 'sunsetMinutesMidnight', 'population_density'
* Drop irrelevant columns in 'X' above for a more narrow scope of features when modeling iterations (out of all the columns, which weather data is best etc.)

# --------------END OF ALT FEATURE SELECTION --------------------------------------------

In [61]:
from sklearn.linear_model import Lasso

# lassoReg = Lasso(alpha=1e-4, normalize=True)
# lassoReg.fit(train_X, train_Y)
# pred = lassoReg.predict(test_X)

In [62]:
# Code reference: https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/#four

def getBestAlpha(dataFrame, outcomeFeatureName, alpha_values, train_X, train_Y ,test_):
    """
    
    Returns a pandas data frame row with best rss score and alpha.
    """
    
    column_names= ['alpha_lasso', 'rss'] + (list(dataFrame.drop([outcomeFeatureName], axis=1).columns))
    
    alpha_df = pd.DataFrame(columns=column_names)
    
    for alpha_val in alpha_values:
        lassoReg = Lasso(alpha=alpha_val, normalize=True)
        lassoReg.fit(train_X, train_Y)
        pred = lassoReg.predict(test_X)

        rss = sum((pred-test_Y)**2)

        new_row = [alpha_val, rss] + (list(lassoReg.coef_))
        alpha_df.loc[len(alpha_df)] = new_row 


    return(alpha_df[alpha_df.rss == alpha_df.rss.min()])
    
    

In [76]:
lassoReg = Lasso(alpha=1, normalize=True)
lassoReg.fit(train_X, train_Y)
# pred = lassoReg.predict(test_X)

# rss = sum((pred-test_Y)**2)

# new_row = [alpha_val, rss] + (list(lassoReg.coef_))
# alpha_df.loc[len(alpha_df)] = new_row 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [66]:
train_X.loc[0].Type

'Normal'

In [45]:
print("\n\n--- Getting Best Alpha Value ---")
alpha_values = [1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2,1, 5, 10]
row = getBestAlpha(ALL_DATA, "Type", alpha_values, train_X, train_Y ,test_X)
print("* First best alpha value from a broad array of numbers between 1e-15 and 10.0: " , (row.alpha_lasso).iloc[0])

row



--- Getting Best Alpha Value ---


ValueError: could not convert string to float: '2016-09-03T03:22:31'

<br>
<br>
<br>

## Correlations
Visualize the correlations of features to our dependant variable: pokemon type.

In [7]:
printAllFeatures()

pokemonId
latitude
longitude
appearedLocalTime
_id
cellId_90m
cellId_180m
cellId_370m
cellId_730m
cellId_1460m
cellId_2920m
cellId_5850m
appearedTimeOfDay
appearedHour
appearedMinute
appearedDayOfWeek
appearedDay
appearedMonth
appearedYear
terrainType
closeToWater
city
continent
weather
temperature
windSpeed
windBearing
pressure
weatherIcon
sunriseMinutesMidnight
sunriseHour
sunriseMinute
sunriseMinutesSince
sunsetMinutesMidnight
sunsetHour
sunsetMinute
sunsetMinutesBefore
population_density
urban
suburban
midurban
rural
gymDistanceKm
gymIn100m
gymIn250m
gymIn500m
gymIn1000m
gymIn2500m
gymIn5000m
pokestopDistanceKm
pokestopIn100m
pokestopIn250m
pokestopIn500m
pokestopIn1000m
pokestopIn2500m
pokestopIn5000m
cooc_1
cooc_2
cooc_3
cooc_4
cooc_5
cooc_6
cooc_7
cooc_8
cooc_9
cooc_10
cooc_11
cooc_12
cooc_13
cooc_14
cooc_15
cooc_16
cooc_17
cooc_18
cooc_19
cooc_20
cooc_21
cooc_22
cooc_23
cooc_24
cooc_25
cooc_26
cooc_27
cooc_28
cooc_29
cooc_30
cooc_31
cooc_32
cooc_33
cooc_34
cooc_35
cooc_36
cooc_

In [8]:
from matplotlib.pyplot import figure

correlation = ALL_DATA.corr() # .drop(['Type'], axis=1)

In [9]:
corr_sorted = correlation.sort_values(by=["Name"])
features = list(correlation.columns.values)
corr_nums = list(correlation.total_cases)

figure(figsize=(10,10))
plt.barh(features, corr_nums, align='center', alpha=0.5)
plt.yticks(features, features)
plt.xlabel('Correlation')
plt.title('Correlation of environmental variables to total cases of Dengue')

plt.show()

KeyError: 'Name'

<br>
<br>
<br>

### Pokemon Types based on weather features

In [51]:
col_pokeID = ["pokemonId"]
cols_time = ["appearedTimeOfDay", "appearedHour", "appearedMinute", "appearedDayOfWeek", "appearedDay", "appearedYear"]
cols_weather = ["weather", "windSpeed", "windBearing", "pressure", "weatherIcon", "sunriseMinutesMidnight", "sunriseHour", "sunriseMinute", "sunriseMinutesSince", "sunsetMinutesMidnight", "sunsetHour", "sunsetMinute", "sunsetMinutesBefore"]

cols_weather_nums =  ["windSpeed", "windBearing", "pressure", "sunriseMinutesMidnight", "sunriseHour", "sunriseMinute", "sunriseMinutesSince", "sunsetMinutesMidnight", "sunsetHour", "sunsetMinute", "sunsetMinutesBefore"]

cols = col_pokeID + cols_weather
geo_weather_data = ALL_DATA[cols]
geo_weather_data

Unnamed: 0,pokemonId,weather,windSpeed,windBearing,pressure,weatherIcon,sunriseMinutesMidnight,sunriseHour,sunriseMinute,sunriseMinutesSince,sunsetMinutesMidnight,sunsetHour,sunsetMinute,sunsetMinutesBefore
0,16,Foggy,4.79,269,1018.02,fog,436,7,16,941,1181,19,41,-196
1,133,Foggy,4.79,269,1018.02,fog,436,7,16,941,1181,19,41,-196
2,16,Clear,4.29,218,1015.29,clear-night,404,6,44,1033,1171,19,31,-266
3,13,PartlyCloudy,5.84,160,1020.52,partly-cloudy-night,398,6,38,858,1179,19,39,-77
4,133,PartlyCloudy,5.84,160,1020.52,partly-cloudy-night,398,6,38,858,1179,19,39,-77
5,21,PartlyCloudy,6.39,218,1024.44,partly-cloudy-day,385,6,25,330,1085,18,5,370
6,66,PartlyCloudy,6.40,218,1024.45,partly-cloudy-day,385,6,25,330,1085,18,5,370
7,27,Clear,11.26,142,1016.69,clear-night,436,7,16,939,1187,19,47,-188
8,35,Foggy,4.79,269,1018.02,fog,436,7,16,941,1181,19,41,-196
9,19,Clear,3.94,253,1020.12,clear-night,437,7,17,997,1195,19,55,-239


# ---------------------------------------------------------------------------------------

### Sample data into smaller portions for modeling (otherwise memory errors)

In [262]:
# Randomly samples X% of your dataframe
sample = ALL_DATA.sample(frac=0.01)
sample.shape

(2960, 210)

### Split again based on new sample size

In [265]:
sample_train_X, sample_test_X, sample_train_Y, sample_test_Y = train_test_split(
                                       sample,      # features
                                       sample["Type"],    # outcome ######################### CHANGE TO ---> DATA_encoded.Type 
                                       test_size=0.20, # percentage of data to use as the test set
                                       random_state=15 # set a random state so it is consistent (not required!)
                                                                            )

print("train features shape", sample_train_X.shape)
print("test features shape", sample_test_X.shape)
print()
print("train outcomes shape", sample_train_Y.shape)
print("test outcomes shape", sample_test_Y.shape)

train features shape (2368, 210)
test features shape (592, 210)

train outcomes shape (2368,)
test outcomes shape (592,)


In [266]:
sample_train_X_results = sample_train_X.Type
sample_train_X = sample_train_X.drop("Type", axis=1)

sample_test_X_results = sample_test_X.Type
sample_test_X = sample_test_X.drop("Type", axis=1)

### Pokemon Type based on Location and Population Density

In [235]:
def buildMultiString(arr, dependantName):
    """
    Takes in an array of features and the name of your dependant
    variable and returns a properly formatted forumal for
    multivariate regression.
    """
    result = dependantName + " ~ " + arr[0]
    for feature in arr[1:]:
        result += " + " + feature
    return(result)

In [267]:
# Get relevant data
location_population_cols = ['latitude', 'longitude', 'population_density', 'urban', 'suburban', 'midurban', 'rural', 'gymDistanceKm']
location_population_data = sample_train_X[location_population_cols]
location_population_data["Type"] = sample_test_X_results

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


#### Multivariable regression - (having issues right now, will revisit)

In [272]:
equation = buildMultiString(location_population_cols, "Type")

multModel = smf.ols(equation, data=location_population_data).fit()


# Prune test data to match encoded
# https://stackoverflow.com/questions/41271725/getting-valueerror-shapes-not-aligned-on-scikit-linear-regression
# test_encoded = pd.get_dummies(location_population_data, columns=location_population_cols)
# test_encoded_for_model = test_encoded.reindex(columns = sample_train_X.columns, fill_value=0)


multivariate_preds = multModel.predict(sample_test_X)
multModel.summary()

PatsyError: model is missing required outcome variables

### KNN Neighbors

In [273]:
# Needs more data pruning when modeling -----------------------------------

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_model = knn_clf.fit(sample_train_X, sample_train_Y)
preds = knn_model.predict(sample_test_X)

ValueError: could not convert string to float: '2016-09-06T16:11:00'