# Introduction

Hopefully, this short tutorial can show you a lot of different commands that will help you gain the most insights into your dataset. 

In [116]:
import pandas as pd
from src.utils import load_data_from_google_drive

# Loading in Data

The first step in any ML problem is identifying what format your data is in, and then loading it into whatever framework you're using. For Kaggle compeitions, a lot of data can be found in CSV files, so that's the example we're going to use. 

We're going to be looking at a sports dataset that shows the results from NCAA basketball games from 1985 to 2016. This dataset is in a CSV file, and the function we're going to use to read in the file is called **pd.read_csv()**. This function returns a **dataframe** variable. The dataframe is the golden jewel data structure for Pandas. It is defined as "a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)".

Just think of it as a table for now. 

In [117]:
df = load_data_from_google_drive(url='https://drive.google.com/file/d/184JcLbSpArA_uq0DgAv2k892KChJVPHt/view?usp=share_link')

In [118]:
df

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0
...,...,...,...,...,...,...,...,...
145284,2016,132,1114,70,1419,50,N,0
145285,2016,132,1163,72,1272,58,N,0
145286,2016,132,1246,82,1401,77,N,1
145287,2016,132,1277,66,1345,62,N,0


# The Basics

Now that we have our dataframe in our variable df, let's look at what it contains. We can use the function **head()** to see the first couple rows of the dataframe (or the function **tail()** to see the last few rows).

In [119]:
df.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


In [120]:
df.tail()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
145284,2016,132,1114,70,1419,50,N,0
145285,2016,132,1163,72,1272,58,N,0
145286,2016,132,1246,82,1401,77,N,1
145287,2016,132,1277,66,1345,62,N,0
145288,2016,132,1386,87,1433,74,N,0


We can see the dimensions of the dataframe using the the **shape** attribute

In [121]:
df.shape

(145289, 8)

We can also extract all the column names as a list, by using the **columns** attribute and can extract the rows with the **index** attribute

In [122]:
df.columns.tolist()

['Season', 'Daynum', 'Wteam', 'Wscore', 'Lteam', 'Lscore', 'Wloc', 'Numot']

In order to get a better idea of the type of data that we are dealing with, we can call the **describe()** function to see statistics like mean, min, etc about each column of the dataset. 

In [123]:
df.describe()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Numot
count,145289.0,145289.0,145289.0,145289.0,145289.0,145289.0,145289.0
mean,2001.574834,75.223816,1286.720646,76.600321,1282.864064,64.497009,0.044387
std,9.233342,33.287418,104.570275,12.173033,104.829234,11.380625,0.247819
min,1985.0,0.0,1101.0,34.0,1101.0,20.0,0.0
25%,1994.0,47.0,1198.0,68.0,1191.0,57.0,0.0
50%,2002.0,78.0,1284.0,76.0,1280.0,64.0,0.0
75%,2010.0,103.0,1379.0,84.0,1375.0,72.0,0.0
max,2016.0,132.0,1464.0,186.0,1464.0,150.0,6.0


Okay, so now let's looking at information that we want to extract from the dataframe. Let's say I wanted to know the max value of a certain column. The function **max()** will show you the maximum values of all columns

In [124]:
df.max()

Season    2016
Daynum     132
Wteam     1464
Wscore     186
Lteam     1464
Lscore     150
Wloc         N
Numot        6
dtype: object

Then, if you'd like to specifically get the max value for a particular column, you pass in the name of the column using the bracket indexing operator

In [125]:
df['Wscore'].max()

186

If you'd like to find the mean of the Losing teams' score. 

In [126]:
df['Lscore'].mean()

64.49700940883343

But what if that's not enough? Let's say we want to actually see the game(row) where this max score happened. We can call the **argmax()** function to identify the row index

In [127]:
df['Wscore'].argmax()

24970

One of the most useful functions that you can call on certain columns in a dataframe is the **value_counts()** function. It shows how many times each item appears in the column. This particular command shows the number of games in each season

In [128]:
df['Season'].value_counts()

Season
2016    5369
2014    5362
2015    5354
2013    5320
2010    5263
2012    5253
2009    5249
2011    5246
2008    5163
2007    5043
2006    4757
2005    4675
2003    4616
2004    4571
2002    4555
2000    4519
2001    4467
1999    4222
1998    4167
1997    4155
1992    4127
1991    4123
1996    4122
1995    4077
1994    4060
1990    4045
1989    4037
1993    3982
1988    3955
1987    3915
1986    3783
1985    3737
Name: count, dtype: int64

**Q**: How many unique seasons are there in the dataset? Use the nunique() function.

In [129]:
# Write your code here

# I will use the nunique() function to the season attribute and find the number of unique seasons
df['Season'].nunique()

32

- There is 32 unique seasons

**Q**: Find the team with the most wins. Use the value_counts() function on the Wteam column.

In [130]:
# Write your code here

# I will select the 'Wteam' attribute and the value_count to find the team with most wins
df['Wteam'].value_counts()

Wteam
1181    819
1242    804
1246    765
1314    761
1112    746
       ... 
1101     18
1446     14
1118      6
1289      6
1327      3
Name: count, Length: 364, dtype: int64

- By using the function value_counts() it sort the data from the team with most wins to least wins.

- Thereby the team with most wins is team 1181 with 819 wins

# Acessing Values

Then, in order to get attributes about the game, we need to use the **iloc[]** function. Iloc is definitely one of the more important functions. The main idea is that you want to use it whenever you have the integer index of a certain row that you want to access. As per Pandas documentation, iloc is an "integer-location based indexing for selection by position."

In [131]:
df.iloc[[df['Wscore'].argmax()]]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
24970,1991,68,1258,186,1109,140,H,0


Let's take this a step further. Let's say you want to know the game with the highest scoring winning team (this is what we just calculated), but you then want to know how many points the losing team scored. 

In [132]:
df.iloc[[df['Wscore'].argmax()]]['Lscore']

24970    140
Name: Lscore, dtype: int64

When you see data displayed in the above format, you're dealing with a Pandas **Series** object, not a dataframe object.

In [133]:
type(df.iloc[[df['Wscore'].argmax()]]['Lscore'])

pandas.core.series.Series

In [134]:
type(df.iloc[[df['Wscore'].argmax()]])

pandas.core.frame.DataFrame

The following is a summary of the 3 data structures in Pandas (Haven't ever really used Panels yet)

![](DataStructures.png)

When you want to access values in a Series, you'll want to just treat the Series like a Python dictionary, so you'd access the value according to its key (which is normally an integer index)

In [135]:
df.iloc[[df['Wscore'].argmax()]]['Lscore'][24970]

140

The other really important function in Pandas is the **loc** function. Contrary to iloc, which is an integer based indexing, loc is a "Purely label-location based indexer for selection by label". Since all the games are ordered from 0 to 145288, iloc and loc are going to be pretty interchangable in this type of dataset

In [136]:
df.iloc[:3]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0


In [137]:
df.loc[:3]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0


Notice the slight difference in that iloc is exclusive of the second number, while loc is inclusive. 

Below is an example of how you can use loc to acheive the same task as we did previously with iloc

In [138]:
df.loc[df['Wscore'].argmax(), 'Lscore']

140

A faster version uses the **at()** function. At() is really useful wheneever you know the row label and the column label of the particular value that you want to get. 

In [139]:
df.at[df['Wscore'].argmax(), 'Lscore']

140

If you'd like to see more discussion on how loc and iloc are different, check out this great Stack Overflow post: http://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation. Just remember that **iloc looks at position** and **loc looks at labels**. Loc becomes very important when your row labels aren't integers. 

# Sorting

Let's say that we want to sort the dataframe in increasing order for the scores of the losing team

In [140]:
df.sort_values('Lscore').head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
100027,2008,66,1203,49,1387,20,H,0
49310,1997,66,1157,61,1204,21,H,0
89021,2006,44,1284,41,1343,21,A,0
85042,2005,66,1131,73,1216,22,H,0
103660,2009,26,1326,59,1359,22,H,0


**Q**: Make three dataframes that are sorted by season, winning team, and winning score respectively. Then, Using iloc, select the rows from index 100 to 200 and the columns for season, winning team, and winning score, respectively. 

In [141]:
# Write your code here

# I will use the code given above and insert 'Season', 'Wteam' and 'Wscore'
# df.sort_values('Season')

# df.sort_values('Wteam')

df.sort_values('Wscore') # Here I only show 'Wscore', but the other two dataframe is above and can be seen by removing #

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
128582,2013,116,1264,34,1193,31,H,0
69336,2001,126,1206,35,1423,33,N,0
5364,1986,77,1229,35,1166,34,A,0
20246,1990,47,1374,36,1426,31,H,0
118590,2011,130,1336,36,1458,33,N,0
...,...,...,...,...,...,...,...,...
20022,1990,40,1116,166,1109,101,H,0
24341,1991,47,1328,172,1258,112,H,0
19653,1990,30,1328,173,1109,101,H,0
17867,1989,92,1258,181,1109,150,H,0


In [142]:
# Now I will use iloc to show specific rows in the dataframes
#df.sort_values('Wscore').iloc[100:201]

#df.sort_values('Wteam').iloc[100:201]

df.sort_values('Season').iloc[100:201] # The reason for making 100 to 201 is because iloc is exclusive of the last number 
                                       # In total the codes give the index from 100 to 200 

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
2525,1985,100,1406,68,1137,65,H,0
2526,1985,100,1410,83,1320,59,H,0
2527,1985,100,1423,58,1187,56,H,0
2528,1985,100,1429,86,1253,59,H,0
2529,1985,100,1440,50,1456,48,H,0
...,...,...,...,...,...,...,...,...
2447,1985,98,1212,63,1398,53,H,0
2448,1985,98,1216,64,1134,63,H,0
2449,1985,98,1233,77,1264,69,H,0
2450,1985,98,1249,80,1427,70,H,0


**Q**: From these three subsets you obtained above, find the season and winning team for the game with the highest winning score.

In [143]:
# Write your code here

# I will use the tail funcion and insert 1 to find the last sorted game in 'Wscore' and find the game with the highest winning score
df.sort_values("Wscore").tail(1) 

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
24970,1991,68,1258,186,1109,140,H,0


- 1991 was the season of the game and the winning team was 1258, and the higeset winning score was 186

# Filtering Rows Conditionally

Now, let's say we want to find all of the rows that satisy a particular condition. For example, I want to find all of the games where the winning team scored more than 150 points. The idea behind this command is you want to access the column 'Wscore' of the dataframe df (df['Wscore']), find which entries are above 150 (df['Wscore'] > 150), and then returns only those specific rows in a dataframe format (df[df['Wscore'] > 150]).

In [144]:
df[df['Wscore'] > 150]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
5269,1986,75,1258,151,1109,107,H,0
12046,1988,40,1328,152,1147,84,H,0
12355,1988,52,1328,151,1173,99,N,0
16040,1989,40,1328,152,1331,122,H,0
16853,1989,68,1258,162,1109,144,A,0
17867,1989,92,1258,181,1109,150,H,0
19653,1990,30,1328,173,1109,101,H,0
19971,1990,38,1258,152,1109,137,A,0
20022,1990,40,1116,166,1109,101,H,0
22145,1990,97,1258,157,1362,115,H,0


This also works if you have multiple conditions. Let's say we want to find out when the winning team scores more than 150 points and when the losing team scores below 100. 

In [145]:
df[(df['Wscore'] > 150) & (df['Lscore'] < 100)]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
12046,1988,40,1328,152,1147,84,H,0
12355,1988,52,1328,151,1173,99,N,0
25656,1991,84,1106,151,1212,97,H,0
28687,1992,54,1261,159,1319,86,H,0
35023,1993,112,1380,155,1341,91,A,0
52600,1998,33,1395,153,1410,87,H,0


**Q**: Create a new column in the DataFrame called 'ScoreDifference' which is the absolute difference between the winning score and the losing score. Filter the DataFrame to only include games where the 'ScoreDifference' is greater than the average 'ScoreDifference' for all games.

In [146]:
# Write your code here

# Make the new column by taking the absolute difference between Wscore and Lscore
df['ScoreDifference'] = abs(df['Wscore'] - df['Lscore']) 

# SD only select games where the score difference is greater then the mean of score difference
SD = df[(df['ScoreDifference'] > df['ScoreDifference'].mean())] 
SD

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,ScoreDifference
0,1985,20,1228,81,1328,64,N,0,17
3,1985,25,1165,70,1432,54,H,0,16
6,1985,25,1228,64,1226,44,N,0,20
8,1985,25,1260,98,1133,80,H,0,18
10,1985,25,1307,103,1288,71,H,0,32
...,...,...,...,...,...,...,...,...,...
145280,2016,131,1401,71,1261,38,N,0,33
145282,2016,131,1433,76,1172,54,N,0,22
145284,2016,132,1114,70,1419,50,N,0,20
145285,2016,132,1163,72,1272,58,N,0,14


- In the new dataFrame, SD, we see that 57227 games have a higher score difference, than the mean score difference for all games.

**Q**: From this filtered DataFrame, find the season and teams involved in the game with the highest 'ScoreDifference'.

In [147]:
# Write your code here


# I will use the sort_value code from previous question, that sort from lowest Score-difference to highest
# Furtermore I will use the tail(1) function to get the game with highest Score-Difference, meaning the last game in the sorted dataFrame
SD.sort_values('ScoreDifference').tail(1)

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,ScoreDifference
44653,1996,48,1409,141,1341,50,H,0,91


- This is the last row in the sorted SD dataFrame, and it shows that team 1409 and team 1341 played a game in season 1996, and the score difference was 91

# Grouping

Another important function in Pandas is **groupby()**. This is a function that allows you to group entries by certain attributes (e.g Grouping entries by Wteam number) and then perform operations on them. The following function groups all the entries (games) with the same Wteam number and finds the mean for each group. 

In [148]:
df.groupby('Wteam')['Wscore'].mean().head()

Wteam
1101    78.111111
1102    69.893204
1103    75.839768
1104    75.825944
1105    74.960894
Name: Wscore, dtype: float64

This next command groups all the games with the same Wteam number and finds where how many times that specific team won at home, on the road, or at a neutral site

In [149]:
df.groupby('Wteam')['Wloc'].value_counts().head(9)

Wteam  Wloc
1101   H        12
       A         3
       N         3
1102   H       204
       A        73
       N        32
1103   H       324
       A       153
       N        41
Name: count, dtype: int64

Each dataframe has a **values** attribute which is useful because it basically displays your dataframe in a numpy array style format

In [150]:
df.values

array([[1985, 20, 1228, ..., 'N', 0, 17],
       [1985, 25, 1106, ..., 'H', 0, 7],
       [1985, 25, 1112, ..., 'H', 0, 7],
       ...,
       [2016, 132, 1246, ..., 'N', 1, 5],
       [2016, 132, 1277, ..., 'N', 0, 4],
       [2016, 132, 1386, ..., 'N', 0, 13]], dtype=object)

Now, you can simply just access elements like you would in an array. 

In [151]:
df.values[0][0]

1985

**Q**: Group the DataFrame by season and find the average winning score for each season.

In [152]:
# Write your code here

# I will use the groupby code use above and insert "Season" and "Wscore"
df.groupby('Season')['Wscore'].mean().head()

Season
1985    74.723040
1986    74.813640
1987    77.993870
1988    79.773704
1989    81.728511
Name: Wscore, dtype: float64

- I use the head() function to get the first 5 season and the average score each season

- From the first 5 season, we see that season 1989 have the highest average score with 81.7

**Q**: Group the DataFrame by winning team and find the maximum winning score for each team across all seasons.

In [153]:
# Write your code here

# Insted of using mean() as I did in the previous code, I will find the maximum winning score and use max() 
df.groupby('Wteam')['Wscore'].max()

Wteam
1101     95
1102    111
1103    109
1104    114
1105    114
       ... 
1460    136
1461    112
1462    125
1463    105
1464    115
Name: Wscore, Length: 364, dtype: int64

- Now we have a dataFrame, with the maximum winning score for each team, and it's the highest score for the teams across the different seasons. 

- If we take team 1101, we see they have a maximum winning score of 95, while team 1460 have 136 as highest score. 

**Q**: Group the DataFrame by both season and winning team. Find the team with the highest average winning score for each season.

In [154]:
# Write your code here

# I will use the groupby function again and insert both "season" and "Wteam" to find the mean "Wscore" 
# I will rename the 'Wscore' to 'AvgWinScore' beacue it's the average winning score for each team
A = df.groupby(['Season','Wteam'])['Wscore'].mean().reset_index(name='AvgWinScore')
A # The code continues below

Unnamed: 0,Season,Wteam,AvgWinScore
0,1985,1102,71.000000
1,1985,1103,70.222222
2,1985,1104,72.095238
3,1985,1106,75.100000
4,1985,1108,85.842105
...,...,...,...
10167,2016,1460,73.450000
10168,2016,1461,73.416667
10169,2016,1462,83.222222
10170,2016,1463,78.238095


In [155]:
# (A) display every "wteam"s average winning score every season, but we just want to find the one team with the highest average winning score
# Therefore we use the groupby function again to create only one column for each season, and finding the index in each season with the highest
# average winning score. 
# The idxmax() finds the team that has the highest score in each season.
max_A_winning_score = A.loc[A.groupby('Season')['AvgWinScore'].idxmax()]
max_A_winning_score.head()

Unnamed: 0,Season,Wteam,AvgWinScore
174,1985,1328,92.8
287,1986,1109,91.2
783,1987,1380,95.875
977,1988,1258,111.75
1268,1989,1258,117.315789


- I show the first five seasons and we see that team 1328 had the highest score in the 1985 season with 92.8

**Q**: Create a new DataFrame that counts the number of wins for each team in each season. This will involve grouping by both season and winning team, and then using the count() function.

In [156]:
# Write your code here

# I create a grouping of 'season' and 'Wteam' and count the number of times the winning team have a 'Wscore' because it means they have won a game
# I do it for all the teams in each season
# I make a new attribute 'Wins' that show the number of wins each team has 
count_wins = df.groupby(['Season','Wteam'])['Wscore'].count().reset_index(name='Wins')
count_wins

Unnamed: 0,Season,Wteam,Wins
0,1985,1102,5
1,1985,1103,9
2,1985,1104,21
3,1985,1106,10
4,1985,1108,19
...,...,...,...
10167,2016,1460,20
10168,2016,1461,12
10169,2016,1462,27
10170,2016,1463,21


- The dataFrame above now show the number of wins for each team.

**Q**: For each season, find the team with the most wins. This will involve creating a DataFrame similar to the one in task 5, and then using the idxmax() function for each season.

In [157]:
# Write your code here

# I can you the dataFrame I just made in the previous task. I will group Season and Wins together and use idxmax() 
# to find the team with most wins every season
most_wins = count_wins.loc[count_wins.groupby('Season')['Wins'].idxmax()]
most_wins

Unnamed: 0,Season,Wteam,Wins
217,1985,1385,27
342,1986,1181,32
818,1987,1424,33
861,1988,1112,31
1323,1989,1328,28
1551,1990,1247,29
1739,1991,1116,30
2088,1992,1181,28
2423,1993,1231,28
2665,1994,1163,27


- We can now see the whole dataFrame with all the teams that had the most wins across the 32 seasons

**Q**: Group the DataFrame by losing team and find the average losing score for each team across all seasons. Compare this with the average winning score for each team from task 3. Are there teams that have a higher average losing score than winning score?

In [158]:
# Write your code here

# I am grouping Season and Lteam with each other, and then I make another attribute in the data that show the average losing score
# I use mean() to find the average losing score for each team every season
Al = df.groupby(["Season","Lteam"])["Lscore"].mean().reset_index(name='AvgLosScore')
Al

Unnamed: 0,Season,Lteam,AvgLosScore
0,1985,1102,61.000000
1,1985,1103,55.142857
2,1985,1104,60.111111
3,1985,1106,69.142857
4,1985,1108,74.000000
...,...,...,...
10178,2016,1460,61.230769
10179,2016,1461,68.277778
10180,2016,1462,71.200000
10181,2016,1463,61.333333


In [159]:
A # This is the dataFrame from task 3 I have to compare 

Unnamed: 0,Season,Wteam,AvgWinScore
0,1985,1102,71.000000
1,1985,1103,70.222222
2,1985,1104,72.095238
3,1985,1106,75.100000
4,1985,1108,85.842105
...,...,...,...
10167,2016,1460,73.450000
10168,2016,1461,73.416667
10169,2016,1462,83.222222
10170,2016,1463,78.238095


In [None]:
# Now I want to make a code that calculate and show the teams that have a higher average losing score than winning score
# First I will merge the two dataFrames, A and Al, together 
comparison_df = pd.merge(A, Al, left_on="Wteam", right_on="Lteam")
comparison_df['Season'] = comparison_df['Season_x']
comparison_df = comparison_df.drop(columns=['Season_x', 'Season_y'])
# Because both A and Al had a column with season the new merge dataset had two columns showing season 
# So I drop one of the columns 

# Now I have a merge dataFrame, but I only want to see the teams with a higher losing score
# Therefore, I will remove all teams that have a higher average winning score than average losing score
teams_higher_losing_score = comparison_df[comparison_df['AvgLosScore'] > comparison_df['AvgWinScore']]

# The new dataFrame only show the teams with a higher losing score
teams_higher_losing_score

Unnamed: 0,Wteam,AvgWinScore,Lteam,AvgLosScore,Season
35,1103,70.222222,1103,71.714286,1985
36,1103,70.222222,1103,75.285714,1985
45,1103,70.222222,1103,73.400000,1985
50,1103,70.222222,1103,70.428571,1985
71,1104,72.095238,1104,73.000000,1985
...,...,...,...,...,...
309130,1460,73.450000,1460,74.333333,2016
309160,1461,73.416667,1461,78.636364,2016
309168,1461,73.416667,1461,77.444444,2016
309169,1461,73.416667,1461,75.666667,2016


- This is the final dataFrame. Looking at the first four rows, we see that team 1103 had four games in season 1985, where they had a higher average losing score than average winning score.

# Dataframe Iteration

In order to iterate through dataframes, we can use the **iterrows()** function. Below is an example of what the first two rows look like. Each row in iterrows is a Series object

In [161]:
for index, row in df.iterrows():
    print(row)
    if index == 1:
        break

Season             1985
Daynum               20
Wteam              1228
Wscore               81
Lteam              1328
Lscore               64
Wloc                  N
Numot                 0
ScoreDifference      17
Name: 0, dtype: object
Season             1985
Daynum               25
Wteam              1106
Wscore               77
Lteam              1354
Lscore               70
Wloc                  H
Numot                 0
ScoreDifference       7
Name: 1, dtype: object


**Q**: Create a new column 'HighScoringGame' that is 'Yes' if the winning score is greater than 100 and 'No' otherwise. This will require iterating over the rows of the DataFrame and checking the value of the winning score for each row.

In [162]:
# Write your code here

# First I create an emtpy list, that will later be a new column in the dataFrame
high_scoring_game = []

# I use the for loop given above. I use a 'if' statement to split the data up into two gruops 'yes' and 'no'
# If the winning score is over 100 it becomes a 'yes' and 'no' otherwise
for index, row in df.iterrows():
    if row['Wscore'] > 100:
        high_scoring_game.append('Yes')
    else:
        high_scoring_game.append('No')

# Now I add the list with 'yes' and 'no' into the dataFrame and creates a new column with the name 'HighScoringGame'
df['HighScoringGame'] = high_scoring_game
df



Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,ScoreDifference,HighScoringGame
0,1985,20,1228,81,1328,64,N,0,17,No
1,1985,25,1106,77,1354,70,H,0,7,No
2,1985,25,1112,63,1223,56,H,0,7,No
3,1985,25,1165,70,1432,54,H,0,16,No
4,1985,25,1192,86,1447,74,H,0,12,No
...,...,...,...,...,...,...,...,...,...,...
145284,2016,132,1114,70,1419,50,N,0,20,No
145285,2016,132,1163,72,1272,58,N,0,14,No
145286,2016,132,1246,82,1401,77,N,1,5,No
145287,2016,132,1277,66,1345,62,N,0,4,No


- With the new column we now see if a game was a high scoring game or not by looking at the last column in the dataFrame

- The first five and last five games are all not a high scoring game

**Q**: Calculate the total number of games played by each team, whether they won or lost. This will require iterating over the rows of the DataFrame and updating a dictionary that keeps track of the number of games for each team.

In [163]:
# Write your code here

# I create an empty dictionary to store the number of games for each team in the dataFrame
games_played = {}

# I use a for loop to iterate over the rows of the DataFrame, extracting only the data from the 'Wteam' and 'Lteam' columns for each row.
for index, row in df.iterrows():
    win_team = row['Wteam']
    los_team = row['Lteam']
    
    # I use this if statement to update the dictionary 'games_played' 
    # Check if the dictionary includes the specific team we've extracted.
    # If the team is already in the dictionary, increment their game count by 1. 
    # If the team is not in the dictionary, add them with an initial count of 1.
    if win_team in games_played:
        games_played[win_team] += 1
    else:
        games_played[win_team] = 1
    
   # This if statement does the same, just for the losing team.
    if los_team in games_played:
        games_played[los_team] += 1
    else:
        games_played[los_team] = 1

# We convert the dictionary into a DataFrame
games_played_df = pd.DataFrame(list(games_played.items()), columns=['Team', 'Total_Games'])
games_played_df

Unnamed: 0,Team,Total_Games
0,1228,992
1,1328,968
2,1106,855
3,1354,906
4,1112,981
...,...,...
359,1297,113
360,1213,84
361,1262,83
362,1101,76


 - This is a new dataFrame showing the different teams and the total number of games they have played across all seasons. And the first teams in the dataFrame have the highest number of games played

**Q**: For each season, find the game with the highest score difference (winning score - losing score). This will require iterating over the rows of the DataFrame, keeping track of the highest score difference for each season, and updating it if a game with a higher score difference is found.

In [164]:
# Write your code here

# I create an empty dictionary that will later contain the highest score difference for each season
highest_score_diff_games = {}

# I iterate over the rows of the df DataFrame and calculate the score difference for each game
for index, row in df.iterrows():
    season = row['Season']
    score_diff = row['Wscore'] - row['Lscore']
    
    # I check if the season is already in the dictionary.
    # If it is not, I insert it.
    # If it is, I compare the current score difference to the one stored in the dictionary
    # and update the dictionary if the current score difference is higher.
    if season not in highest_score_diff_games or score_diff > highest_score_diff_games[season]['Score_Difference']:
        highest_score_diff_games[season] = {
            'Game_Index': index,
            'Wteam': row['Wteam'],  # Teams involved in the game
            'Lteam': row['Lteam'],
            'Wscore': row['Wscore'], # Winning and losing scores for the game 
            'Lscore': row['Lscore'],
            'Score_Difference': score_diff
        }

# I convert the dictionary into a pandas DataFrame
highest_score_diff_df = pd.DataFrame.from_dict(highest_score_diff_games, orient='index')
highest_score_diff_df


Unnamed: 0,Game_Index,Wteam,Lteam,Wscore,Lscore,Score_Difference
1985,236,1361,1288,128,68,60
1986,4731,1314,1264,129,45,84
1987,8240,1155,1118,112,39,73
1988,12046,1328,1147,152,84,68
1989,16677,1242,1135,115,45,70
1990,19502,1181,1217,130,54,76
1991,25161,1163,1148,115,47,68
1992,27997,1116,1126,128,46,82
1993,33858,1328,1197,146,65,81
1994,36404,1228,1152,121,52,69


- The dataFrame show which game had the highest score difference each season 

Remember, iterating over a DataFrame should generally be avoided if a vectorized operation can be used instead, as vectorized operations are usually much faster. However, these tasks are designed to give practice with DataFrame iteration for cases where it might be necessary.

Vectorized Operation Example: Create a new column 'HighScoringGame' in the DataFrame using a vectorized operation. This column should contain 'Yes' if the winning score is greater than 100 and 'No' otherwise. Use the np.where function from the numpy library for this task.

In [165]:
import numpy as np
df['HighScoringGame'] = np.where(df['Wscore'] > 100, 'Yes', 'No')

**Q**: Vectorized Operation: Calculate the total number of games played by each team, whether they won or lost. Instead of iterating over the DataFrame, use the value_counts() function on the winning team and losing team columns separately, and then add the two Series together.

In [166]:
# Write your code here

# I use the valuse_count() function for all the winning and losing teams
# First I use it to find the number of wins for each team
wins_count = df['Wteam'].value_counts()

# Then for the number of losses for each team
losses_count = df['Lteam'].value_counts()

# The add comando is used to add losses_count together with wins_count
total_games = wins_count.add(losses_count, fill_value=0).reset_index()

# I name the two columns Team and total_games 
total_games.columns = ['Team', 'Total_Games']
total_games


Unnamed: 0,Team,Total_Games
0,1101,76
1,1102,840
2,1103,910
3,1104,975
4,1105,447
...,...,...
359,1460,827
360,1461,914
361,1462,954
362,1463,838


- Now we can find a team, and see how many games they have played across the 32 seasons

**Q**: For each season, find the game with the highest score difference (winning score - losing score). Instead of iterating over the DataFrame, create a new column 'ScoreDifference' using vectorized subtraction, then use the groupby() function and idxmax() function to find the game with the highest score difference for each season.

In [176]:
# Write your code here

# I will make a column that takes the score difference between the winning team and the losing team for each game
df['ScoreDifference'] = df['Wscore'] - df['Lscore']

# I find the highest score difference for each season by using the groupby function 
# and use idxmax() to find the highest score difference for every season
High_score_differnce = df.groupby('Season')['ScoreDifference'].idxmax()

# I only select the games with the highest score difference in the dataFrame 
highest_score_diff_games = df.loc[High_score_differnce]
highest_score_diff_games


Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,ScoreDifference,HighScoringGame
236,1985,33,1361,128,1288,68,H,0,60,Yes
4731,1986,60,1314,129,1264,45,N,0,84,Yes
8240,1987,51,1155,112,1118,39,H,0,73,Yes
12046,1988,40,1328,152,1147,84,H,0,68,Yes
16677,1989,64,1242,115,1135,45,H,0,70,Yes
19502,1990,26,1181,130,1217,54,H,0,76,Yes
25161,1991,73,1163,115,1148,47,H,0,68,Yes
27997,1992,30,1116,128,1126,46,H,0,82,Yes
33858,1993,86,1328,146,1197,65,H,0,81,Yes
36404,1994,47,1228,121,1152,52,H,0,69,Yes


- Looking at the last column, we see that every game is a high scoring game except the games from 2012 and 2013. 

# Extracting Rows and Columns

The bracket indexing operator is one way to extract certain columns from a dataframe.

In [168]:
df[['Wscore', 'Lscore']].head()

Unnamed: 0,Wscore,Lscore
0,81,64
1,77,70
2,63,56
3,70,54
4,86,74


Notice that you can acheive the same result by using the loc function. Loc is a veryyyy versatile function that can help you in a lot of accessing and extracting tasks. 

In [169]:
df.loc[:, ['Wscore', 'Lscore']].head()

Unnamed: 0,Wscore,Lscore
0,81,64
1,77,70
2,63,56
3,70,54
4,86,74


Note the difference is the return types when you use brackets and when you use double brackets. 

In [170]:
type(df['Wscore'])

pandas.core.series.Series

In [171]:
type(df[['Wscore']])

pandas.core.frame.DataFrame

You've seen before that you can access columns through df['col name']. You can access rows by using slicing operations. 

In [172]:
df[0:3]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,ScoreDifference,HighScoringGame
0,1985,20,1228,81,1328,64,N,0,17,No
1,1985,25,1106,77,1354,70,H,0,7,No
2,1985,25,1112,63,1223,56,H,0,7,No


Here's an equivalent using iloc

In [173]:
df.iloc[0:3,:]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,ScoreDifference,HighScoringGame
0,1985,20,1228,81,1328,64,N,0,17,No
1,1985,25,1106,77,1354,70,H,0,7,No
2,1985,25,1112,63,1223,56,H,0,7,No


# Data Cleaning

One of the big jobs of doing well in Kaggle competitions is that of data cleaning. A lot of times, the CSV file you're given (especially like in the Titanic dataset), you'll have a lot of missing values in the dataset, which you have to identify. The following **isnull** function will figure out if there are any missing values in the dataframe, and will then sum up the total for each column. In this case, we have a pretty clean dataset.

In [174]:
df.isnull().sum()

Season             0
Daynum             0
Wteam              0
Wscore             0
Lteam              0
Lscore             0
Wloc               0
Numot              0
ScoreDifference    0
HighScoringGame    0
dtype: int64

If you do end up having missing values in your datasets, be sure to get familiar with these two functions. 
* **dropna()** - This function allows you to drop all(or some) of the rows that have missing values. 
* **fillna()** - This function allows you replace the rows that have missing values with the value that you pass in.

# Other Useful Functions

* **drop()** - This function removes the column or row that you pass in (You also have the specify the axis). 
* **agg()** - The aggregate function lets you compute summary statistics about each group
* **apply()** - Lets you apply a specific function to any/all elements in a Dataframe or Series
* **get_dummies()** - Helpful for turning categorical data into one hot vectors.
* **drop_duplicates()** - Lets you remove identical rows

# Lots of Other Great Resources

Pandas has been around for a while and there are a lot of other good resources if you're still interested on getting the most out of this library. 
* http://pandas.pydata.org/pandas-docs/stable/10min.html
* https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
* http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
* https://www.dataquest.io/blog/pandas-python-tutorial/
* https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view
* https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y