![Moneyball](https://proxy.duckduckgo.com/iu/?u=http%3A%2F%2F2.bp.blogspot.com%2F-L8BZ-HVVizs%2FToIogdMGV-I%2FAAAAAAAAAsk%2FeVODynuBxZE%2Fs1600%2Fmoneyball.png&f=1&nofb=1)

# Moneyball homework

## Rules of Baseball

You don't need to know much about Baseball to complete this exercise. If you're totally unfamiliar with Baseball, check out this useful [explanatory video](https://www.youtube.com/watch?v=0bKkGeROiPA)!

### [Background](https://www.youtube.com/watch?v=yGf6LNWY9AI)
*Source: Wikipedia*

### The 2002 Oakland A's

The Oakland Athletics' 2002 season was the team's 35th in Oakland, California. It was also the 102nd season in franchise history. The Athletics finished first in the American League West with a record of 103-59.

The Athletics' 2002 campaign ranks among the most famous in franchise history. Following the 2001 season, Oakland saw the departure of three key players (the lost boys). Billy Beane, the team's general manager, responded with a series of under-the-radar free agent signings. The new-look Athletics, despite a comparative lack of star power, surprised the baseball world by besting the 2001 team's regular season record. The team is most famous, however, for winning 20 consecutive games between August 13 and September 4, 2002.[1] The Athletics' season was the subject of Michael Lewis' 2003 book Moneyball: The Art of Winning an Unfair Game (as Lewis was given the opportunity to follow the team around throughout that season)

This project is based off the book written by Michael Lewis (later turned into a movie).
### Moneyball Book

The central premise of book Moneyball is that the collective wisdom of baseball insiders (including players, managers, coaches, scouts, and the front office) over the past century is subjective and often flawed. Statistics such as stolen bases, runs batted in, and batting average, typically used to gauge players, are relics of a 19th-century view of the game and the statistics available at that time. The book argues that the Oakland A's' front office took advantage of more analytical gauges of player performance to field a team that could better compete against richer competitors in Major League Baseball (MLB).

Rigorous statistical analysis had demonstrated that on-base percentage and slugging percentage are better indicators of offensive success, and the A's became convinced that these qualities were cheaper to obtain on the open market than more historically valued qualities such as speed and contact. These observations often flew in the face of conventional baseball wisdom and the beliefs of many baseball scouts and executives.

By re-evaluating the strategies that produce wins on the field, the 2002 Athletics, with approximately US 44 million dollars in salary, were competitive with larger market teams such as the New York Yankees, who spent over US$125 million in payroll that same season.

Because of the team's smaller revenues, Oakland is forced to find players undervalued by the market, and their system for finding value in undervalued players has proven itself thus far. This approach brought the A's to the playoffs in 2002 and 2003.

In this project we'll work with some data and with the goal of trying to find replacement players for the ones lost at the start of the off-season - During the 2001–02 offseason, the team lost three key free agents to larger market teams: 2000 AL MVP Jason Giambi to the New York Yankees, outfielder Johnny Damon to the Boston Red Sox, and closer Jason Isringhausen to the St. Louis Cardinals.

The main goal of this project is for you to feel comfortable working with Python on real data to try and derive actionable insights!
Let's get started!

***Follow the steps outlined in bold below using your new Python skills and help the Oakland A's recruit under-valued players!***

![Gameplan](https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Ftse1.mm.bing.net%2Fth%3Fid%3DOIP.RI7hlTMbnkPlzADQTtMeqwHaE8%26pid%3DApi&f=1)

### Data

We'll be using data from Sean Lahaman's Website a very useful source for baseball statistics. The documentation for the csv files is located in the readme2013.txt file. You may need to reference this to understand what acronyms stand for.

**Use Pandas to open the Batting file and assign it to a dataframe called batting**

In [1]:
# Define the url the data is located at
batting_url = 'https://rotterdamai001.blob.core.windows.net/python/mlb/core/Batting.csv'

In [2]:
# Import pandas for data processing
import pandas as pd
# Save batting dataframe
batting = pd.read_csv(batting_url)

**Review the first 5 rows of the batting file**

In [3]:
# Output the first five rows in the console
batting.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abercda01,1871,1,TRO,,1,4,0,0,0,...,0.0,0.0,0.0,0,0.0,,,,,0.0
1,addybo01,1871,1,RC1,,25,118,30,32,6,...,13.0,8.0,1.0,4,0.0,,,,,0.0
2,allisar01,1871,1,CL1,,29,137,28,40,4,...,19.0,3.0,1.0,2,5.0,,,,,1.0
3,allisdo01,1871,1,WS3,,27,133,28,44,10,...,27.0,1.0,1.0,0,2.0,,,,,0.0
4,ansonca01,1871,1,RC1,,25,120,29,39,11,...,16.0,6.0,2.0,2,1.0,,,,,0.0


**Review the structure of batting. Pay close attention to how columns that start with a number get an 'X' in front of them! You'll need to know this to call those columns!**

In [4]:
# Rename columns starting with a number and place 'X' in front of them
batting.rename({'2B':'X2B', '3B':'X3B'}, inplace=True, axis=1)
# Output the structure of batting using .info()
batting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105861 entries, 0 to 105860
Data columns (total 22 columns):
playerID    105861 non-null object
yearID      105861 non-null int64
stint       105861 non-null int64
teamID      105861 non-null object
lgID        105123 non-null object
G           105861 non-null int64
AB          105861 non-null int64
R           105861 non-null int64
H           105861 non-null int64
X2B         105861 non-null int64
X3B         105861 non-null int64
HR          105861 non-null int64
RBI         105105 non-null float64
SB          103493 non-null float64
CS          82320 non-null float64
BB          105861 non-null int64
SO          103761 non-null float64
IBB         69210 non-null float64
HBP         103044 non-null float64
SH          99792 non-null float64
SF          69757 non-null float64
GIDP        80420 non-null float64
dtypes: float64(9), int64(10), object(3)
memory usage: 17.8+ MB


**Call the head of the doubles (X2B) column**

In [5]:
# Return the first 5 rows of the X2B column
batting.X2B.head()

0     0
1     6
2     4
3    10
4    11
Name: X2B, dtype: int64

## Feature Engineering

We need to add three more statistics that were used in Moneyball! These are:

* [Batting Average](https://en.wikipedia.org/wiki/Batting_average)
* [On Base Percentage](https://en.wikipedia.org/wiki/On-base_percentage)
* [Slugging Percentage](https://en.wikipedia.org/wiki/Slugging_percentage)

Click on the links provided and search the wikipedia page for the formula for creating the new metric!

Which means that the Batting Average is equal to H (Hits) divided by AB (At Base).

**Create a new column with the Batting Average metric. Name this Column 'BA'.**

In [6]:
# BattingAverage if the amount of actual Hits over a player's At Bats (~Attempts)
batting['BA'] = batting['H']/batting['AB']

**After doing this operation, check the last 5 entries of the BA column of your data frame.**

In [7]:
# Using .tail() we retrieve the last five records
batting.BA.tail()

105856    0.226415
105857    0.000000
105858    0.263889
105859    0.305495
105860    0.201072
Name: BA, dtype: float64

Now do the same for some new columns! On Base Percentage (OBP) and Slugging Percentage (SLG). Hint: For SLG, you need 1B (Singles), this isn't in your data frame. However you can calculate it by subtracting doubles,triples, and home runs from total hits (H): 1B = H-2B-3B-HR

* **Create an OBP Column**
* **Create an SLG Column**

In [8]:
# On Base Percentage = (Hits+BasesOnBalls+HitByPitch) / (At Bats+BasesOnBalls+HitByPitch+SacrificeFlies)
batting['OBP'] = (batting.H + batting.BB + batting.HBP) / (batting.AB + batting.BB + batting.HBP + batting.SF)

In [9]:
# Creating X1B (Singles)
batting['X1B'] = batting['H'] - (batting['HR'] + batting['X3B'] + batting['X2B'])

In [10]:
# Creating Slugging Average (SLG)
batting['SLG'] =  ((4*batting['HR']) + (3*batting['X3B']) + (2*batting['X2B']) + (1*batting['X1B'])) / batting['AB']

**Check the structure of your data frame using .info()**

In [11]:
batting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105861 entries, 0 to 105860
Data columns (total 26 columns):
playerID    105861 non-null object
yearID      105861 non-null int64
stint       105861 non-null int64
teamID      105861 non-null object
lgID        105123 non-null object
G           105861 non-null int64
AB          105861 non-null int64
R           105861 non-null int64
H           105861 non-null int64
X2B         105861 non-null int64
X3B         105861 non-null int64
HR          105861 non-null int64
RBI         105105 non-null float64
SB          103493 non-null float64
CS          82320 non-null float64
BB          105861 non-null int64
SO          103761 non-null float64
IBB         69210 non-null float64
HBP         103044 non-null float64
SH          99792 non-null float64
SF          69757 non-null float64
GIDP        80420 non-null float64
BA          89521 non-null float64
OBP         53929 non-null float64
X1B         105861 non-null int64
SLG         89521 non

## Merging Salary Data with Batting Data

We know we don't just want the best players, we want the most undervalued players, meaning we will also need to know current salary information! We have salary information in a csv file.

Complete the following steps to merge the salary data with the player stats!

**Load the Salaries.csv file into a dataframe called sal.**

In [12]:
sal_url = 'https://rotterdamai001.blob.core.windows.net/python/mlb/core/Salaries.csv'

In [13]:
sal = pd.read_csv(sal_url)

**Use describe to get a summary of the batting data frame and notice the minimum year in the yearID column. Our batting data goes back to 1871! Our salary data starts at 1985, meaning we need to remove the batting data that occured before 1985.**

**Reassign batting to only contain data from 1985 and onwards**

In [14]:
# Get the summary statistics on the dataset
batting.describe()

Unnamed: 0,yearID,stint,G,AB,R,H,X2B,X3B,HR,RBI,...,SO,IBB,HBP,SH,SF,GIDP,BA,OBP,X1B,SLG
count,105861.0,105861.0,105861.0,105861.0,105861.0,105861.0,105861.0,105861.0,105861.0,105105.0,...,103761.0,69210.0,103044.0,99792.0,69757.0,80420.0,89521.0,53929.0,105861.0,89521.0
mean,1965.77529,1.078773,51.214338,140.960694,18.694212,36.861583,6.266321,1.271875,2.843209,16.968432,...,20.577057,1.086086,1.060809,2.252535,1.043594,2.9404,0.208322,0.25938,26.480177,0.290522
std,39.319486,0.286613,47.057599,184.433173,28.180404,52.47126,9.666868,2.621256,6.354058,26.342342,...,28.345666,2.75042,2.290011,4.198339,1.950372,4.7085,0.122852,0.145034,37.33465,0.188487
min,1871.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1935.0,1.0,13.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.145833,0.186441,0.0,0.176471
50%,1975.0,1.0,34.0,48.0,4.0,9.0,1.0,0.0,0.0,3.0,...,9.0,0.0,0.0,0.0,0.0,0.0,0.230769,0.290566,7.0,0.309524
75%,2000.0,1.0,80.0,229.0,27.0,57.0,9.0,1.0,2.0,24.0,...,29.0,1.0,1.0,3.0,1.0,4.0,0.273973,0.337079,42.0,0.398773
max,2018.0,5.0,165.0,716.0,198.0,262.0,67.0,36.0,73.0,191.0,...,223.0,120.0,51.0,67.0,19.0,36.0,1.0,1.0,225.0,4.0


In [15]:
# Take a subset of the batting dataframe and save it, again, into batting
batting = batting[batting.yearID >= 1985]

**Now use .describe() again to make sure the subset reassignment worked, your yearID min should be 1985.**

In [16]:
# Get the summary statistics on the dataset
batting.describe()

Unnamed: 0,yearID,stint,G,AB,R,H,X2B,X3B,HR,RBI,...,SO,IBB,HBP,SH,SF,GIDP,BA,OBP,X1B,SLG
count,43606.0,43606.0,43606.0,43606.0,43606.0,43606.0,43606.0,43606.0,43606.0,43606.0,...,43606.0,43606.0,43606.0,43606.0,43606.0,43606.0,31883.0,31985.0,43606.0,31883.0
mean,2002.634821,1.083291,50.634958,122.732629,16.381759,32.016787,6.177567,0.689676,3.59556,15.528689,...,23.864973,0.949342,1.109801,1.172889,1.010044,2.751021,0.202672,0.258944,21.553983,0.301263
std,9.66086,0.290565,46.010253,180.753142,26.928199,50.462246,10.204954,1.609161,7.401939,26.429059,...,34.818562,2.672178,2.46207,2.397355,1.940023,4.673531,0.131209,0.151523,33.837804,0.208998
min,1985.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1995.0,1.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.131579,0.181818,0.0,0.16
50%,2003.0,1.0,33.0,21.0,1.0,3.0,0.0,0.0,0.0,1.0,...,6.0,0.0,0.0,0.0,0.0,0.0,0.230769,0.293774,2.0,0.333333
75%,2011.0,1.0,75.0,186.0,22.0,46.0,9.0,1.0,3.0,20.0,...,35.0,1.0,1.0,1.0,1.0,4.0,0.272727,0.339161,31.0,0.422481
max,2018.0,5.0,163.0,716.0,152.0,262.0,59.0,23.0,73.0,165.0,...,223.0,120.0,35.0,39.0,17.0,35.0,1.0,1.0,225.0,4.0


Now it is time to merge the batting data with the salary data! Since we have players playing multiple years, we'll have repetitions of playerIDs for multiple years, meaning we want to merge on both players and years.

**Use the pd.merge() function to merge the batting and sal data frames by ['playerID','yearID','teamID']. Call the new data frame combo.**

In [17]:
# We merge based on three columns, as a plaer may have played for 2 teams in one year
combo = batting.merge(sal,on=['playerID','yearID','teamID'])
combo.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID_x,G,AB,R,H,X2B,...,HBP,SH,SF,GIDP,BA,OBP,X1B,SLG,lgID_y,salary
0,ackerji01,1985,1,TOR,AL,61,0,0,0,0,...,0.0,0.0,0.0,0.0,,,0,,AL,170000
1,agostju01,1985,1,CHA,AL,54,0,0,0,0,...,0.0,0.0,0.0,0.0,,,0,,AL,147500
2,aguaylu01,1985,1,PHI,NL,91,165,27,46,7,...,6.0,4.0,3.0,7.0,0.278788,0.377551,30,0.466667,NL,237000
3,alexado01,1985,1,TOR,AL,36,0,0,0,0,...,0.0,0.0,0.0,0.0,,,0,,AL,875000
4,allenne01,1985,1,SLN,NL,23,2,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,NL,750000


**Retrieve the summary statistics of our new combo data.**

In [18]:
combo.describe()

Unnamed: 0,yearID,stint,G,AB,R,H,X2B,X3B,HR,RBI,...,IBB,HBP,SH,SF,GIDP,BA,OBP,X1B,SLG,salary
count,25441.0,25441.0,25441.0,25441.0,25441.0,25441.0,25441.0,25441.0,25441.0,25441.0,...,25441.0,25441.0,25441.0,25441.0,25441.0,19858.0,19915.0,25441.0,19858.0,25441.0
mean,2000.917535,1.007625,67.843717,173.077473,23.346567,45.643882,8.815888,0.971306,5.132896,22.280885,...,1.434102,1.554459,1.636846,1.4626,3.915019,0.211102,0.269012,30.723792,0.315935,2114363.0
std,8.917739,0.094362,47.616122,205.066987,31.100341,57.922592,11.766046,1.897025,8.70225,30.694146,...,3.295388,2.884414,2.843783,2.279098,5.408808,0.119831,0.140616,38.879314,0.191636,3476874.0
min,1985.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1994.0,1.0,29.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.159091,0.206897,0.0,0.2,300000.0
50%,2001.0,1.0,56.0,63.0,5.0,11.0,2.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,1.0,0.241573,0.304734,9.0,0.351974,575000.0
75%,2009.0,1.0,107.0,331.0,41.0,85.0,16.0,1.0,7.0,38.0,...,1.0,2.0,2.0,2.0,7.0,0.275641,0.344444,57.0,0.431233,2400000.0
max,2016.0,4.0,163.0,716.0,152.0,262.0,59.0,23.0,73.0,165.0,...,120.0,35.0,39.0,17.0,35.0,1.0,1.0,225.0,4.0,33000000.0


## Analyzing lost players

The Oakland A's lost 3 key players during the off-season. We'll want to get their stats to see what we have to replace. The players lost were: first baseman 2000 AL MVP Jason Giambi (giambja01) to the New York Yankees, outfielder Johnny Damon (damonjo01) to the Boston Red Sox and infielder Rainer Gustavo "Ray" Olmedo ('saenzol01').

**Get a data frame called lost_players from the combo data frame consisting of those 3 players.**

In [19]:
lost_players_Ids = ['giambja01','damonjo01','saenzol01']
lost_players = combo[combo.playerID.isin(lost_players_Ids)]

**Review the first 40 rows of lost_players.**

In [20]:
# use .head() with a parameter for n of rows
lost_players.head(40)

Unnamed: 0,playerID,yearID,stint,teamID,lgID_x,G,AB,R,H,X2B,...,HBP,SH,SF,GIDP,BA,OBP,X1B,SLG,lgID_y,salary
7295,damonjo01,1995,1,KCA,AL,47,188,32,53,11,...,1.0,2.0,3.0,2.0,0.281915,0.323529,34,0.441489,AL,109000
7392,giambja01,1995,1,OAK,AL,54,176,27,45,7,...,3.0,1.0,2.0,4.0,0.255682,0.363636,32,0.397727,AL,109000
8246,damonjo01,1996,1,KCA,AL,145,517,61,140,22,...,3.0,10.0,5.0,4.0,0.270793,0.31295,107,0.367505,AL,180000
8330,giambja01,1996,1,OAK,AL,140,536,84,156,40,...,5.0,1.0,5.0,15.0,0.291045,0.355109,95,0.481343,AL,120000
9156,damonjo01,1997,1,KCA,AL,146,472,70,130,12,...,3.0,6.0,1.0,3.0,0.275424,0.337838,102,0.385593,AL,240000
9238,giambja01,1997,1,OAK,AL,142,519,66,152,41,...,6.0,0.0,8.0,11.0,0.292871,0.362245,89,0.495183,AL,205000
10055,damonjo01,1998,1,KCA,AL,161,642,104,178,30,...,4.0,3.0,3.0,4.0,0.277259,0.339463,120,0.439252,AL,460000
10146,giambja01,1998,1,OAK,AL,153,562,92,166,28,...,5.0,0.0,9.0,16.0,0.295374,0.383562,111,0.489324,AL,315000
11020,damonjo01,1999,1,KCA,AL,145,583,101,179,39,...,3.0,3.0,4.0,13.0,0.307033,0.378995,117,0.476844,AL,2100000
11111,giambja01,1999,1,OAK,AL,158,575,115,181,36,...,7.0,0.0,8.0,11.0,0.314783,0.421583,111,0.553043,AL,2103333


Since all these players were lost in after 2001 in the offseason, let's only concern ourselves with the data from 2001.

**Save a subset of lost_players again to only grab the rows where the yearID was 2001.**

In [21]:
lost_players[lost_players.yearID == 2001]

Unnamed: 0,playerID,yearID,stint,teamID,lgID_x,G,AB,R,H,X2B,...,HBP,SH,SF,GIDP,BA,OBP,X1B,SLG,lgID_y,salary
12749,damonjo01,2001,1,OAK,AL,155,644,108,165,34,...,5.0,5.0,4.0,7.0,0.256211,0.323529,118,0.363354,AL,7100000
12830,giambja01,2001,1,OAK,AL,154,520,109,178,47,...,13.0,0.0,9.0,17.0,0.342308,0.4769,91,0.659615,AL,4103333
13237,saenzol01,2001,1,OAK,AL,106,305,33,67,21,...,13.0,1.0,3.0,9.0,0.219672,0.291176,36,0.383607,AL,290000


**Reduce the lost_players data frame to only include the following columns: playerID,H,X2B,X3B,HR,OBP,SLG,BA,AB**

In [22]:
# Using .loc[row_indexer,column_indexer] we reduce the DataFrame
include_columns = ['playerID','H','X2B','X3B','HR','OBP','SLG','BA','AB']
lost_players = lost_players.loc[:,include_columns]

**Print the .head() of lost_players dataframe**

In [23]:
lost_players.head()

Unnamed: 0,playerID,H,X2B,X3B,HR,OBP,SLG,BA,AB
7295,damonjo01,53,11,5,3,0.323529,0.441489,0.281915,188
7392,giambja01,45,7,0,6,0.363636,0.397727,0.255682,176
8246,damonjo01,140,22,5,6,0.31295,0.367505,0.270793,517
8330,giambja01,156,40,1,20,0.355109,0.481343,0.291045,536
9156,damonjo01,130,12,8,8,0.337838,0.385593,0.275424,472


## Replacement Players

Now we have all the information we need! Here is your final task - Find Replacement Players for the key three players we lost! However, you have three constraints:

* **The total combined salary of the three players can not exceed 15 million dollars.**
* **Their combined number of At Bats (AB) needs to be equal to or greater than the lost players.**
* **Their mean OBP had to equal to or greater than the mean OBP of the lost players.**

Use the combo dataframe you previously created as the source of information! Remember to just use the 2001 subset of that dataframe. There's lost of different ways you can do this, so be creative! It should be relatively simple to find 3 players that satisfy the requirements, note that there are many correct combinations available!

There are a lot of correct answers for this part! This is where you can really have fun and explore the data with ggplot, figure out which are good data points to split your data on to find replacement players. This ending is left intentionally more open-ended so you can get a feel for exploring real data!

In [24]:
print("Average AB",lost_players.AB.mean())
print("Average OBP",lost_players.OBP.mean())

Average AB 424.4186046511628
Average OBP 0.36510441063193805


In [25]:
# Reduce the dataset to include statistics data of season of the year 2001
combo = combo[combo.yearID==2001]
# Let us focus on the most important columns for this task
combo = combo.loc[:,['playerID','AB','OBP','salary']]
# We remove the players we've lost from the DataFrame, NOTE: the use of ~ as a negator
combo = combo[~combo.playerID.isin(lost_players_Ids)]
# Retrieve players where AB >= than ta bit less than the average
combo = combo[combo.AB > lost_players.AB.mean()]
# I exclude players with a salary higher than 8 million
combo = combo[combo.salary <= 7000000]
# Retrieve data where OBP larger thant 0.34
combo = combo[combo.OBP >= lost_players.OBP.mean()]
# Create a new metric that combines both AB & OBP
combo['MAX'] = combo['AB'] * combo['OBP']
combo.sort_values(by='MAX', ascending=False)

Unnamed: 0,playerID,AB,OBP,salary,MAX
13297,suzukic01,692,0.381471,5666667,263.978202
12847,gonzalu01,609,0.428571,4833333,261.0
12894,heltoto01,587,0.431655,4950000,253.381295
12651,berkmla01,577,0.430233,305000,248.244186
12622,bagweje01,600,0.39749,6500000,238.493724
13181,pujolal01,590,0.402963,200000,237.748148
13288,stewash01,640,0.37106,2183333,237.47851
12619,aurilri01,636,0.368805,3250000,234.559767
13167,pierrju01,617,0.378176,215000,233.334828
12666,boonebr01,623,0.372263,3250000,231.919708


In [26]:
# From this list I choose berkmla01, pujolal01, pierrju01. In this way we may throw a big party with the leftover budget.
selected_players = ['berkmla01','pujolal01','pierrju01']
combo = combo[combo.playerID.isin(selected_players)]
combo.describe()

Unnamed: 0,AB,OBP,salary,MAX
count,3.0,3.0,3.0,3.0
mean,594.666667,0.403791,240000.0,239.775721
std,20.404248,0.026038,56789.083458,7.65869
min,577.0,0.378176,200000.0,233.334828
25%,583.5,0.39057,207500.0,235.541488
50%,590.0,0.402963,215000.0,237.748148
75%,603.5,0.416598,260000.0,242.996167
max,617.0,0.430233,305000.0,248.244186


In [27]:
# About that party...
budget = 15000000
print('The remaining budget for a party is', budget - combo.salary.sum())

The remaining budget for a party is 14280000


## More exercises like these?

This project was inspired by Jose Portilla's Data Science and Machine learning in R course un [Udemy](https://www.udemy.com/course/data-science-and-machine-learning-bootcamp-with-r/learn/lecture/5412744?start=180#overview). Jose has great content regarding Data Science using R as well as Python. If you liked this assignment and want to be challenged more please find one of his courses through the Udemy platform.

![ImageOfUdemy](https://miro.medium.com/max/1200/1*HuQyl7_WMMzOfs8RIlQ-XA.png)