In [24]:
import pandas as pd
import numpy as np
import math
import scipy.stats as ss
import thinkplot
import thinkstats2
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

##Seaborn for fancy plots. 
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = (15,5)

# Assignment 3 - Basic Predictions and Regression

## Questions

### Part 1 - Election Prediction

Suppose you are looking at an election in a fictional province. There are 7 total elctoral districts, and the winner in each district is determined by a first-past-the-post system (what we have in Canada - the most votes wins, regardless of share). There are two parties - the Purples and the Yellows. Whoever controls the most seats will be the ruling party - so in our 2 party scenario, the party who wins 4 or more of the districts will govern. There is an election every year, they love voting. 

Recent polling indicating the expected vote share in each district is shown in the "dist_polls" table below. These values are a composite of several polls that the experts have combined and weighted. The "Purple" values show the expected vote share of the Purple party, along with the variance of that expectation and the number of polls that were combined to get that result.  

As well, research has shown that the vote distribution is impacted by voter turnout. In general, the more people vote, the more the vote split shifts towards the Yellow party. We have data on past elections and the results, we expect that the turnout will be in line with the past elections - or more specifically, we have no reason to expect it to differ. This impact is measured in the table in code below - that table shows the voter turnout, in a percentage, as well as the change in the Yellow party's vote share (also in percentage) as compared to the polling averages. For example, if one row showed "52" and ".8", that would mean that voter turnout was 52%, and the Yellow party got .8% higher of a vote share than the polling showed. Assume 60% of people will turnout for this election.

<b>What is the probability that the Purple Party controls the government after the election?</b>

<b>Note:</b> the errors and confidence intervals are not totally trivial. As part of the written answer, offer an evaluation of your confidence in the prediction, and why you think that. This is not a question with one specific error, your estimation will have some expected errors, somewhere. You may not have the tools to calculate it all the way through, that's fine. 

### Question 1 - Your Answer in English

Please fill in (and extend if required) the list here to explain what you did. There are multiple reasonable things you could do to approach this, so please note what you did here. For most people I assume this will be about 3-5 statements - you don't need to explain the internals of things we covered (e.g. if there's a hypothesis test, you don't need to explain how that works), just how you structured your approach to the problem. 

<ul>
<li>Averaged out the polling data
<li>created a linear regression to calculate the effect voter turnout has on the Yellow party votes
<li>modified the polling data based on the linear regression
<li>did simulations that generated 200 runs of 5000 votes each, taking the means to estimate each district's chance of being won by Purple 
<li>Based on my simulations, Purple will not win the election.
<li> <b> What do you think about the error/accuracy</b>: The errors are very low for each district. The accuracy mostly alright, being that Purple vote probability hovers around 35-40%, but given that I keep getting all or nothing values for each district makes it a little suspect. The R Squared for the Linear Regression of the Voter Turnout and it's effect on the average votes is really low, but given that we don't have a lot of data, all we can do is use most of the data for training.</b>
</ul>

##### Setup Poll Data

The dataframe "dist_polls" contains all of the polls for each seat. Each value is expressed as expected vote share (as a ratio) for the <b>Purple</b> party. The Yellow party can be safely assumed to get the rest of the votes. 

In [25]:
# Please don't edit this part. 
# Setup polling data. 
districts = [1,2,3,4,5,6,7]
dist_polls = pd.DataFrame(districts, columns={"district"})

dist_polls["Poll_1"] = [.55, .49, .51, .6, .41, .46, .54]
dist_polls["Poll_2"] = [.53, .51, .51, .62, .44, .48, .53]
dist_polls["Poll_3"] = [.51, .49, .53, .61, .42, .46, .52]
dist_polls["Poll_4"] = [.47, .48, .51, .54, .45, .45, .51]
dist_polls["Poll_5"] = [.61, .52, .49, .73, .44, .51, .53]
dist_polls["Poll_6"] = [.54, .45, .51, .61, .47, .52, .52]
dist_polls["Poll_7"] = [.55, .47, .5, .56, .47, .46, .56]
dist_polls["Poll_8"] = [.53, .49, .51, .55, .43, .49, .55]
dist_polls["Poll_9"] = [.57, .39, .52, .57, .53, .43, .53]


dist_polls.head()

Unnamed: 0,district,Poll_1,Poll_2,Poll_3,Poll_4,Poll_5,Poll_6,Poll_7,Poll_8,Poll_9
0,1,0.55,0.53,0.51,0.47,0.61,0.54,0.55,0.53,0.57
1,2,0.49,0.51,0.49,0.48,0.52,0.45,0.47,0.49,0.39
2,3,0.51,0.51,0.53,0.51,0.49,0.51,0.5,0.51,0.52
3,4,0.6,0.62,0.61,0.54,0.73,0.61,0.56,0.55,0.57
4,5,0.41,0.44,0.42,0.45,0.44,0.47,0.47,0.43,0.53


##### Setup Turnout Data

The dataframe "past_vote_table" shows the voter turnout, along with the impact on the votes counted for the <b>Yellow party</b>, all expressed as percentages. For example, if in one row the turnout is .45 and the Yellow_improvement is -.04, that means that 45% of the populace turned out to vote, and the Yellow party got 4% fewer votes than polling indicated. 

In [26]:
# Please don't edit this part. 
# Setup vote data. 
voter_turnout_history = [.53, .51, .48, .55, .54, .59, .49, .57, .56]
past_vote_table = pd.DataFrame(voter_turnout_history, columns={"voter_turn_percentage"})
past_vote_table["Yellow_improvement"] = [.012, .023, -.017, .031, .030, -.004, -.03, .042, .029]
past_vote_table["year"] = [2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]
past_vote_table.head()

Unnamed: 0,voter_turn_percentage,Yellow_improvement,year
0,0.53,0.012,2013
1,0.51,0.023,2014
2,0.48,-0.017,2015
3,0.55,0.031,2016
4,0.54,0.03,2017


## Start Your Work

### Part 1 - Election

In [27]:
#Generate one vote, with a likelihood supplied as an argument. 
def oneVote(probCalc):
    vote = np.random.binomial(n=1, p=probCalc)
    return vote

#Generate multiple votes
def getSample(voteProb, n=1000):
    vote_list = []
    for i in range(n):
        vote_list.append(oneVote(voteProb))
    return vote_list

#Generate multiple sample runs, taking the average
def getSamples(voteProb, n=1000, samples=100, ciLow=2.5, ciHi=97.5):
    meanList = []
    for i in range(samples):
        meanList.append(np.mean(getSample(voteProb, n)))
    muList = [voteProb] * samples
    cdf = thinkstats2.Cdf(meanList) #Make a CDF of the means of the analytical dist's
    ci = cdf.Percentile(ciLow), cdf.Percentile(ciHi) #5th, 95th percentiles. 
    stderr = mean_squared_error(meanList, muList, squared=False)
    return meanList, stderr, cdf, ci

#Collect pertinent info from multiple runs
def simulate(probValue):
    means, err, cdfFin, ciFin = getSamples(probValue, n=1000, samples=100, ciLow=2.5, ciHi=97.5)
    pWins = 0 
    for i in range(100):
        if means[i] > .5000:
            pWins = pWins + 1
    return (pWins/100), err, cdfFin, ciFin

In [62]:
y = np.array(past_vote_table['Yellow_improvement']).reshape(-1,1)
x = np.array(past_vote_table['voter_turn_percentage']).reshape(-1,1)
xTrain, xTest, yTrain, yTest = train_test_split(x,y,test_size=0.2,random_state=42)

voterModel = LinearRegression().fit(xTrain,yTrain)
yellowEffect = voterModel.predict(np.array(0.60).reshape(1,-1))
rSq = voterModel.score(xTest, yTest)
print(yellowEffect[0][0], rSq)


0.032054140127388525 -4.928570902788179


In [135]:
dist_polls['Average Expected Vote'] = (dist_polls['Poll_1']+dist_polls['Poll_2']+dist_polls['Poll_3']+dist_polls['Poll_4']+dist_polls['Poll_5']+dist_polls['Poll_6']+dist_polls['Poll_7']+dist_polls['Poll_8']+dist_polls['Poll_9'])/9
dist_polls['Average Expected Vote'] = dist_polls['Average Expected Vote']-np.full(shape=7, fill_value=yellowEffect[0][0], dtype=np.float32)
dist_polls.head(7)

Unnamed: 0,district,Poll_1,Poll_2,Poll_3,Poll_4,Poll_5,Poll_6,Poll_7,Poll_8,Poll_9,Average Expected Vote
0,1,0.55,0.53,0.51,0.47,0.61,0.54,0.55,0.53,0.57,0.47445
1,2,0.49,0.51,0.49,0.48,0.52,0.45,0.47,0.49,0.39,0.411117
2,3,0.51,0.51,0.53,0.51,0.49,0.51,0.5,0.51,0.52,0.44445
3,4,0.6,0.62,0.61,0.54,0.73,0.61,0.56,0.55,0.57,0.533339
4,5,0.41,0.44,0.42,0.45,0.44,0.47,0.47,0.43,0.53,0.385561
5,6,0.46,0.48,0.46,0.45,0.51,0.52,0.46,0.49,0.43,0.407784
6,7,0.54,0.53,0.52,0.51,0.53,0.52,0.56,0.55,0.53,0.466673


In [144]:
# Do some stuff
chancesPerDistrict = []
errors = []
districtSim1 = simulate(dist_polls['Average Expected Vote'].iloc[0])
districtSim2 = simulate(dist_polls['Average Expected Vote'].iloc[1])
districtSim3 = simulate(dist_polls['Average Expected Vote'].iloc[2])
districtSim4 = simulate(dist_polls['Average Expected Vote'].iloc[3])
districtSim5 = simulate(dist_polls['Average Expected Vote'].iloc[4])
districtSim6 = simulate(dist_polls['Average Expected Vote'].iloc[5])
districtSim7 = simulate(dist_polls['Average Expected Vote'].iloc[6])

chancesPerDistrict.append(districtSim1[0])
errors.append(districtSim1[1])
chancesPerDistrict.append(districtSim2[0])
errors.append(districtSim2[1])
chancesPerDistrict.append(districtSim3[0])
errors.append(districtSim3[1])
chancesPerDistrict.append(districtSim4[0])
errors.append(districtSim4[1])
chancesPerDistrict.append(districtSim5[0])
errors.append(districtSim5[1])
chancesPerDistrict.append(districtSim6[0])
errors.append(districtSim6[1])
chancesPerDistrict.append(districtSim7[0])
errors.append(districtSim7[1])


In [146]:
print("The Purple party has a "+str(chancesPerDistrict[0]*100)+"% chance to win District 1, a "+str(chancesPerDistrict[1]*100)+"% chance to win District 2, a "+str(chancesPerDistrict[2]*100)+"% chance to win District 3, a "+str(chancesPerDistrict[3]*100)+"% chance to win District 4, a "+str(chancesPerDistrict[4]*100)+"% chance to win District 5, a "+str(chancesPerDistrict[5]*100)+"% chance to win District 6, and a "+str(chancesPerDistrict[6]*100)+"% chance to win District 7 with a confidence interval of 2.5 to 97.5 \n")

print(errors)

The Purple party has a 4.0% chance to win District 1, a 0.0% chance to win District 2, a 0.0% chance to win District 3, a 98.0% chance to win District 4, a 0.0% chance to win District 5, a 0.0% chance to win District 6, and a 0.0% chance to win District 7 with a confidence interval of 2.5 to 97.5 

[0.016770207673989887, 0.013447594715015592, 0.01617068827053432, 0.01748363928264515, 0.01465553989825994, 0.01644215830421805, 0.014126933759525126]


### Part 2 - Regression

<b>Use the data provided to try to predict the wage. </b>

The data is from FIFA rankings for players. You don't need to know anything about soccer or video games for this, so if these values are meaningless to you, just treat them as numbers and you'll be fine. All of the features are rankings are evaluations of how good different soccar players are at different skills.

#### Answer in English

Please fill in (and extend if required) the list here to explain what you did. There are multiple reasonable things you could do to approach this, so please note what you did here. For most people I assume this will be about 3-5 statements - you don't need to explain the internals of things we covered (e.g. if there's a hypothesis test, you don't need to explain how that works), just how you structured your approach to the problem. 

<ul>
<li>Filtered out outlier wages from the data
<li>created a linear regression trained on that data
<li>tested the model and found an RMSE of 1845 and an R Squared of 0.25.
<li>I'm unsure as to how to increase the R Squared of the model, as removing columns really didn't seem to increase it at all.
</ul>

In [67]:
df = pd.read_csv("players_20_2.csv")
df.head()

Unnamed: 0,wage_eur,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,...,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking,defending_standing_tackle,defending_sliding_tackle
0,565000,88,95,70,92,88,97,93,94,92,...,94,48,40,94,94,75,96,33,37,26
1,405000,84,94,89,83,87,89,81,76,77,...,93,63,29,95,82,85,95,28,32,24
2,290000,87,87,62,87,87,96,88,87,81,...,84,51,36,87,90,90,94,27,26,29
3,125000,13,11,15,43,13,12,13,14,40,...,12,34,19,11,65,11,68,27,12,18
4,470000,81,84,61,89,83,95,83,79,83,...,80,54,41,87,89,88,91,34,27,22


### Part 2 Work

In [72]:
# clean and modify the data so that it is usable
df = df[df['wage_eur'] < 10000]
df = df[df['wage_eur'] > 0]
df.describe()

Unnamed: 0,wage_eur,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,...,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking,defending_standing_tackle,defending_sliding_tackle
count,13918.0,13918.0,13918.0,13918.0,13918.0,13918.0,13918.0,13918.0,13918.0,13918.0,...,13918.0,13918.0,13918.0,13918.0,13918.0,13918.0,13918.0,13918.0,13918.0,13918.0
mean,2888.633424,46.992743,42.923265,49.921828,55.989366,39.985343,52.792643,44.337189,40.148944,50.017747,...,43.68494,53.250036,44.024788,47.302773,50.848541,46.216913,55.356157,44.776045,45.675815,43.915002
std,2160.136829,17.369479,18.649022,16.678526,13.967362,16.286857,18.232768,17.125694,16.14837,14.485749,...,18.192711,16.725327,19.850531,18.622796,12.91417,14.957505,10.755981,19.122304,20.683852,20.200786
min,1000.0,5.0,2.0,5.0,7.0,3.0,4.0,6.0,4.0,8.0,...,4.0,9.0,3.0,2.0,9.0,7.0,12.0,1.0,5.0,3.0
25%,1000.0,35.0,28.0,43.0,52.0,29.0,47.0,32.0,29.0,40.0,...,29.0,42.0,24.0,36.0,42.0,37.0,49.0,28.0,25.0,23.0
50%,2000.0,51.0,46.0,54.0,60.0,41.0,59.0,46.0,39.0,53.0,...,48.0,56.0,50.0,53.0,53.0,47.0,57.0,50.0,53.0,51.0
75%,4000.0,61.0,59.0,62.0,65.0,53.0,65.0,58.0,52.0,61.0,...,58.0,66.0,61.75,61.0,61.0,58.0,63.0,61.0,64.0,62.0
max,9000.0,85.0,83.0,91.0,84.0,85.0,87.0,85.0,87.0,83.0,...,85.0,94.0,83.0,84.0,83.0,87.0,87.0,83.0,83.0,85.0


In [69]:
#do a linear regression, train test split
yWage = np.array(df['wage_eur']).reshape(-1,1)
xWage = np.array(df.drop(columns='wage_eur'))

xTrainWage, xTestWage, yTrainWage, yTestWage = train_test_split(xWage,yWage,test_size=0.33)

wageModel = LinearRegression().fit(xTrainWage,yTrainWage)
rSq = wageModel.score(xTestWage, yTestWage)
tmp = wageModel.predict(xTestWage)
rmse = mean_squared_error(tmp, yTestWage, squared=False)
print('R Squared:',rSq)
print('RMSE:',rmse)

R Squared: 0.2520606033846725
RMSE: 1845.077071512537
