### INFO 3401

Sample exploratory data analysis for module 2.1

In [1]:
import numpy as np
import pandas as pd
import sqlite3
import os

## Step One

I went and and downloaded the [Lahman](https://www.kaggle.com/datasets?fileType=sqlite&sizeEnd=50%2CMB) database from Kaggle. This is a database of baseball facts. I picked this dataset because I know many 3401 students like sports analytics, and I wanted to choose data that is interesting to at least some of the class

In [2]:
# here is how you connect to the Lahman database and describe its tables

import sqlite3

con = sqlite3.connect("lahmansbaseballdb.sqlite") # pass a string pointing to the .sqlite file on your machine

# get the db name
db_name = pd.read_sql("PRAGMA database_list;", con)["name"][0]

In [3]:
# The name of main table in your database might be different.
# Let's go list the tables in the database
lahmans = pd.read_sql("SELECT * FROM {}.sqlite_master WHERE type='table';".format(db_name), con=con)

lahmans

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,all_star,all_star,2,"CREATE TABLE all_star (\n player_id TEXT,\n..."
1,table,appearances,appearances,217,"CREATE TABLE appearances (\n year INTEGER,\..."
2,table,manager_award,manager_award,5278,CREATE TABLE manager_award (\n player_id TE...
3,table,player_award,player_award,5288,CREATE TABLE player_award (\n player_id TEX...
4,table,manager_award_vote,manager_award_vote,5560,CREATE TABLE manager_award_vote (\n award_i...
5,table,player_award_vote,player_award_vote,5580,CREATE TABLE player_award_vote (\n award_id...
6,table,batting,batting,5817,"CREATE TABLE batting (\n player_id TEXT,\n ..."
7,table,batting_postseason,batting_postseason,11278,CREATE TABLE batting_postseason (\n year IN...
8,table,player_college,player_college,11882,CREATE TABLE player_college (\n player_id T...
9,table,fielding,fielding,12362,"CREATE TABLE fielding (\n player_id TEXT,\n..."


## Step two

I just printed out the structure of the database and am curious about a few tables. Let's take a look at the hall of fame table. It seems like this table shows players in the hall of fame.

In [4]:
pd.read_sql_query('select * from hall_of_fame limit 5',con)

Unnamed: 0,player_id,yearid,votedby,ballots,needed,votes,inducted,category,needed_note
0,cobbty01,1936,BBWAA,226,170,222,Y,Player,
1,ruthba01,1936,BBWAA,226,170,215,Y,Player,
2,wagneho01,1936,BBWAA,226,170,215,Y,Player,
3,mathech01,1936,BBWAA,226,170,205,Y,Player,
4,johnswa01,1936,BBWAA,226,170,189,Y,Player,


I don't know which player IDs refer to which players, so it looks like to find anything interesting from the hall of fame table I need the players table also (to find the names linked to IDs).

In [5]:
pd.read_sql_query('select * from player limit 5', con)

Unnamed: 0,player_id,birth_year,birth_month,birth_day,birth_country,birth_state,birth_city,death_year,death_month,death_day,...,name_last,name_given,weight,height,bats,throws,debut,final_game,retro_id,bbref_id
0,aardsda01,1981,12,27,USA,CO,Denver,,,,...,Aardsma,David Allan,220,75,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934,2,5,USA,AL,Mobile,,,,...,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939,8,5,USA,AL,Mobile,1984.0,8.0,16.0,...,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954,9,8,USA,CA,Orange,,,,...,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972,8,25,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01


Looking again at the tables in the notebook I see there is also an all star table. Let's look at that. 

In [6]:
pd.read_sql_query('select * from all_star limit 5', con)

Unnamed: 0,player_id,year,game_num,game_id,team_id,league_id,gp,starting_pos
0,gomezle01,1933,0,ALS193307060,NYA,AL,1,1
1,ferreri01,1933,0,ALS193307060,BOS,AL,1,2
2,gehrilo01,1933,0,ALS193307060,NYA,AL,1,3
3,gehrich01,1933,0,ALS193307060,DET,AL,1,4
4,dykesji01,1933,0,ALS193307060,CHA,AL,1,5


Now I am starting to think of an interesting question. Are there players that were often all stars but who were not elected to the hall of fame? Maybe these are **forgotten stars**! They should have made it to the hall of fame, but they were robbed! To investigate, I need to find out how many all star games each player played in. This will require joins and aggregation. 

But before I am ready to proceed, I need to understand the data a little better. It is **very** common and **very** important to take time to understand data before drawing conclusions. In this case, it seems like the all_star table has a game_num field. What does that field represent? I am confused if that field shows the number of all star games that a player played in (over time) or shows something else. To investivate I will have to check the documentation or field list for the database in the [readme](https://www.kaggle.com/seanlahman/the-history-of-baseball). Your database might have slightly different documentation conventions. Learning how to read and make sense of documentation (including data documentation) is an important skill. 

In my case, it seems like the game_num field refers to the number of the all star game the player played in. My guess is that this means that there must have been more than 1 all star games in some seasons. I can double check that belief by seeing all of the possible values of `game_num` field with the following query. Looking, at the output it seems like `game_num` is no more than 2 for any player, so my hypothesis about the value of the field makes sense. Again, I am thinking through the documentation and checking my own assumptions about the data using code here. 

In [7]:
pd.read_sql_query('select distinct game_num from all_star', con)

Unnamed: 0,game_num
0,0
1,2
2,1


At the end of step 2, I will write my question in bold text to make it clear what I am trying to find out. Please do the same!

**Question: are there players that were often all stars but who were not elected to the hall of fame?**

## Step 3

In [8]:
# To answer, I make a dataframe called all_stars that joins the players table and all_star table
all_stars = pd.read_sql_query('select all_star.player_id from all_star inner join player on all_star.player_id = player.player_id', con)

# I group by player_id using pandas
g = all_stars.groupby('player_id')

# I use the .size command to learn the size (i.e. # rows) of each group, and then reset the indexes.
# I find it is often helpful to reset the pandas indexes to keep things simple
all_star_counts = g.size().to_frame().reset_index()

# I rename a column created by my groupby/size operations to something more meaningful
all_star_counts = all_star_counts.rename(columns={0: "N_all_star_games"})

# Finally, I sort the values in the table 
all_star_counts.sort_values(by=['N_all_star_games'], ascending=False)

# I should emphasize that I did not just type out these commands effortlessly. As I went, I examined intermediate tables
# to check my results, and consulting the documentation to understand what was happening. You should plan on
# doing the same. 

Unnamed: 0,player_id,N_all_star_games
0,aaronha01,25
1125,musiast01,24
1011,mayswi01,24
968,mantlmi01,20
1321,ripkeca01,19
...,...,...
1157,odayda01,1
1158,odeake01,1
1161,odonojo01,1
1162,odoulle01,1


Now I am going to pause to check and consider my results. This table  says the player aaronha01 played in all star games. Is that right? Well who is `aaronha01`? Again, I use code to check my intuitions about this data.

In [9]:
pd.read_sql_query('select * from player where player_id = "aaronha01"', con)

Unnamed: 0,player_id,birth_year,birth_month,birth_day,birth_country,birth_state,birth_city,death_year,death_month,death_day,...,name_last,name_given,weight,height,bats,throws,debut,final_game,retro_id,bbref_id
0,aaronha01,1934,2,5,USA,AL,Mobile,,,,...,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01


According to the table I just made, Henry Louis Allen (aka Hank Aaron) played in 25 all star games. Is that right? Actually I see from this [Wikipedia](https://en.wikipedia.org/wiki/Hank_Aaron)  page that Aaron was credited with 24 games. What might be going on?


In [10]:
pd.read_sql_query('select * from all_star where player_id="aaronha01"', con)

Unnamed: 0,player_id,year,game_num,game_id,team_id,league_id,gp,starting_pos
0,aaronha01,1955,0,NLS195507120,ML1,NL,1,
1,aaronha01,1956,0,ALS195607100,ML1,NL,1,
2,aaronha01,1957,0,NLS195707090,ML1,NL,1,9.0
3,aaronha01,1958,0,ALS195807080,ML1,NL,1,9.0
4,aaronha01,1959,1,NLS195907070,ML1,NL,1,9.0
5,aaronha01,1959,2,NLS195908030,ML1,NL,1,9.0
6,aaronha01,1960,1,ALS196007110,ML1,NL,1,9.0
7,aaronha01,1960,2,ALS196007130,ML1,NL,1,9.0
8,aaronha01,1961,1,NLS196107110,ML1,NL,1,
9,aaronha01,1961,2,ALS196107310,ML1,NL,1,


I see that the gp field for one row is 0 which looks suspicious. If I check the documentation, I see that this field is 1 if the player played in the game. So I am going to amend my JOIN query to set `gp=1` (see below). This sort of validation is an important step when doing data analysis.

In [11]:
all_stars = pd.read_sql_query('select all_star.player_id from all_star inner join player on all_star.player_id = player.player_id where gp=1', con)
g = all_stars.groupby('player_id')
all_star_counts = g.size().to_frame().reset_index()
all_star_counts = all_star_counts.rename(columns={0: "N_all_star_games"})
all_star_counts.sort_values(by=['N_all_star_games'], ascending=False) 

# now all_star_counts shows that Hank Aaron played in 24 all star games, which matches what I know from Wikipedia

Unnamed: 0,player_id,N_all_star_games
0,aaronha01,24
954,musiast01,24
861,mayswi01,24
1430,willite01,18
1121,ripkeca01,18
...,...,...
702,kempst01,1
704,kendrho01,1
707,keougma02,1
709,kerrbu01,1


In [12]:
# Now let's merge in the hall_of_fame data
hall = pd.read_sql_query('select  * from hall_of_fame', con)

# I will use pandas to do the join. You could also do this in SQL. The API is similar. 
# There will be important differences in performance between SQL and pandas that you can ignore in 3401 but might be important in other contexts
merged = pd.merge(all_star_counts, hall, how='left', on='player_id')

all_stars_vs_hall = merged[['player_id', "N_all_star_games", "inducted"]]

all_stars_vs_hall = all_stars_vs_hall.sort_values(by=['N_all_star_games'], ascending=False)

# because I did a left join, if inducted is a NaN this means the player does not show up in the hall of fame table
# Thus the player never made it to the hall of fame and we can set the value to N. I think only the inducted column is N
all_stars_vs_hall = all_stars_vs_hall.fillna(value="N")

# We're getting close. Now we have a table showing the number of all start games each player played
# We also see if the player was inducted into the hall of fame
all_stars_vs_hall 

Unnamed: 0,player_id,N_all_star_games,inducted
0,aaronha01,24,Y
1843,mayswi01,24,Y
2077,musiast01,24,Y
2467,robinbr01,18,Y
2439,ripkeca01,18,Y
...,...,...,...
1017,gordosi01,1,N
2585,saundjo01,1,N
1001,gordode01,1,N
2587,scheiri01,1,N


In [13]:
# Now we can make a table that shows the top 10 players not inducted into the hall of fame, ranked by # all star games

most_robbed = all_stars_vs_hall[all_stars_vs_hall["inducted"] == "N"].drop_duplicates().head(10)

most_robbed

Unnamed: 0,player_id,N_all_star_games,inducted
2503,rosepe01,16,N
158,berrayo01,15,N
2475,rodriiv01,14,N
885,foxne01,13,N
209,bondsba01,13,N
1377,jeterde01,13,N
26,alomaro01,12,N
1493,killeha01,11,N
687,dimagjo01,11,N
2199,ottme01,11,N


In [14]:
players = pd.read_sql_query('select * from player', con)

most_robbed_plus = pd.merge(most_robbed, players, how='inner', on='player_id')

most_robbed_plus = most_robbed_plus[['name_last', 'name_first', 'inducted', 'N_all_star_games']]

most_robbed_plus

Unnamed: 0,name_last,name_first,inducted,N_all_star_games
0,Rose,Pete,N,16
1,Berra,Yogi,N,15
2,Rodriguez,Ivan,N,14
3,Fox,Nellie,N,13
4,Bonds,Barry,N,13
5,Jeter,Derek,N,13
6,Alomar,Roberto,N,12
7,Killebrew,Harmon,N,11
8,DiMaggio,Joe,N,11
9,Ott,Mel,N,11


This looks pretty good. I see that the top player who did not make the hall is Pete Rose, who I know was famously banned from the sport for gambling. But I also see that Joe DiMaggio did not make the hall of fame. Based on my knowledge of baseball, that seems a bit off. I check DiMaggio's [Wikipedia](https://en.wikipedia.org/wiki/Joe_DiMaggio) page and see both that Joltin' Joe was the son of a San Francisco fisherman (fun fact) and inducted into the hall of fame in 1955 (troubling fact). Something is wrong with my analysis becuase Joe DiMaggio should not show up in the `most_robbed_plus` table. Again, I use my own domain knowledge to check conclusions from my data analysis. If I use the query below, I see that DiMaggio went through four votes before he was elected into the hall. My code above assumes that if the value of the "inducted" field is N the player was *never* inducted to the hall. To clean up, I am going to need to requery the hall_of_fame table where inducted=Y. I include my amended code below.

In [15]:
hall = pd.read_sql_query('select  * from hall_of_fame where player_id="dimagjo01"', con)
hall

Unnamed: 0,player_id,yearid,votedby,ballots,needed,votes,inducted,category,needed_note
0,dimagjo01,1945,BBWAA,247,186,1,N,Player,
1,dimagjo01,1953,BBWAA,264,198,117,N,Player,
2,dimagjo01,1954,BBWAA,252,189,175,N,Player,
3,dimagjo01,1955,BBWAA,251,189,223,Y,Player,


In [24]:
hall = pd.read_sql_query('select  * from hall_of_fame where inducted="Y"', con)
merged = pd.merge(all_star_counts, hall, how='left', on='player_id')
most_robbed = merged[['player_id', "N_all_star_games", "inducted"]]
most_robbed = most_robbed.fillna("N").sort_values(by=['N_all_star_games'], ascending=False)
most_robbed = most_robbed[most_robbed["inducted"] == "N"].drop_duplicates().head(10)
players = pd.read_sql_query('select * from player', con)
most_robbed = pd.merge(most_robbed, players, how='inner', on='player_id')[["player_id", "N_all_star_games", "inducted", "name_last", "name_first"]]
most_robbed

Unnamed: 0,player_id,N_all_star_games,inducted,name_last,name_first
0,rosepe01,16,N,Rose,Pete
1,rodriiv01,14,N,Rodriguez,Ivan
2,bondsba01,13,N,Bonds,Barry
3,jeterde01,13,N,Jeter,Derek
4,rodrial01,11,N,Rodriguez,Alex
5,clemero02,10,N,Clemens,Roger
6,suzukic01,10,N,Suzuki,Ichiro
7,garvest01,10,N,Garvey,Steve
8,boyerke01,10,N,Boyer,Ken
9,guerrvl01,9,N,Guerrero,Vladimir


This table seems pretty good. But if I look up Ivan Rodriguez on Wikipedia I see that he was inducted into the hall of fame in 2017. Something seems wrong. Again, I need to check my results. If I query the database I see that the max year id is 2016, so Rodriguez was inducted after the time range of the dataset. The same is true for Vladimir Guerrero, who was elected in 2018. Barry Bonds and Roger Clemens are each controversial picks because of steroid use. So overall this table seems pretty reasonable.

In [21]:
pd.read_sql_query('select  * from hall_of_fame where player_id="rodriiv01"', con)

Unnamed: 0,player_id,yearid,votedby,ballots,needed,votes,inducted,category,needed_note


In [25]:
hall = pd.read_sql_query('select max(yearid) from hall_of_fame', con)
hall

Unnamed: 0,max(yearid)
0,2016


## Step 4

I conducted an analysis of the Lehman's baseball database, examining which players who were frequent all stars were not inducted into the Baseball Hall of Fame. Looking over the 10 players (by all star appearances) who were *not* inducted into the Hall, I found a few things. 

1. Some players from this list (e.g. Barry Bonds or Roger Clemens) are excluded from the Hall of Fame because of steroid use. Similarly, Pete Rose is excluded because of gambling.
2. Some players from this list have recently been inducted into the Hall of Fame (e.g. Ivan Rodriguez or Vladimir Guerrero). So number of all star games seems like a decent predictor of future induction into the Hall of Fame.
3. Some players from this list (e.g. Steve Garvey) seem like good candidates for the hall of fame

In the future, I might extend this analysis by looking at the distribution of All Star Game appearances from players in the Hall of Fame. I might want to know, what fraction of players with $K$ All Star appearances make the Hall? I present my table below.

In [28]:
most_robbed

Unnamed: 0,player_id,N_all_star_games,inducted,name_last,name_first
0,rosepe01,16,N,Rose,Pete
1,rodriiv01,14,N,Rodriguez,Ivan
2,bondsba01,13,N,Bonds,Barry
3,jeterde01,13,N,Jeter,Derek
4,rodrial01,11,N,Rodriguez,Alex
5,clemero02,10,N,Clemens,Roger
6,suzukic01,10,N,Suzuki,Ichiro
7,garvest01,10,N,Garvey,Steve
8,boyerke01,10,N,Boyer,Ken
9,guerrvl01,9,N,Guerrero,Vladimir
