# S06 T01: Tasca mètodes de mostreig

**Descripció**:

Aprèn a realitzar mostreig de les dades amb Python.

**Nivell 1**

- Exercici 1 :

Agafa un conjunt de dades de tema esportiu que t'agradi. Realitza un mostreig de les dades generant una mostra aleatòria simple i una mostra sistemàtica.

# NBA Players

## Biometric, biographic and basic box score features from 1996 to 2019 season

Dataset file: [NBA Players](https://www.kaggle.com/justinas/nba-players-data?select=all_seasons.csv)

**Description**:

**Update 02-08-2021**: The data now includes 2020 season and metrics for 2019 have been updated.

**Update 08-03-2020**: The data now includes 2017, 2018 and 2019 seasons. Keep in mind that metrics like gp, pts, reb, etc. are not complete for 2019 season, as it is ongoing at the time of upload.

Context :
As a life-long fan of basketball I always wanted to combine my enthusiasm for the sport with passion for analytics. So, I utilized the NBA Stats API to pull together this data set. I hope it will prove to be as interesting to work with for you as it has been for me!

Content :
The data set contains over two decades of data on each player who has been part of an NBA teams' roster.  captures demographic variables such as age, height, weight and place of birth, biographical details like the team played for, draft year and round. In addition, it has basic box score statistics such as games played, average number of points, rebounds,It assists, etc.

The pull initially contained 52 rows of missing data. The gaps have been manually filled using data from Basketball Reference. I am not aware of any other data quality issues.

Analysis Ideas :
The data set can be used to explore how age/height/weight tendencies have changed over time due to changes in game philosophy and player development strategies. Also, it could be interesting to see how geographically diverse the NBA is and how oversees talents have influenced it. A longitudinal study on players' career arches can also be performed.

`Player` - Player name

`team_abbreviation` - team names

`age` - age of player

The variables are quite self explanatory until

`gp` - games played

`pts` - average pts per game

`reb` - rebounds per game

`ast` - assists per game

`net_rating` - Team’s point differential per 100 possessions while the player is on the court

`oreb_pct` - Percentage of available offensive rebounds the player grabbed while he was on the floor

`dreb_pct` - Percentage of available defensive rebounds the player grabbed while he was on the floor

`usg_pct` - Percentage of team plays used by the player while he was on the floor

`ts_pct` - Measure of the player’s shooting efficiency that takes into account free throws, 2 and 3 point shots

`ast_pct` - Percentage of teammate field goals the player assisted while he was on the floor

`season` - NBA season

In [1]:
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [2]:
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# Any results you write to the current directory are saved as output.

- Set pandas float_format with pandas set_option

In [3]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_columns', 1000, 'display.width', 1000, 'display.max_rows',1000)
#pd.set_option('display.max_colwidth', None)

1. Read the data as a pandas dataframe and display the first 5 rows

In [4]:
all_seasons_df = pd.read_csv("./input/all_seasons.csv").drop('Unnamed: 0',axis = 1) #, index_col = 0
all_seasons_df.head()

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,draft_number,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season
0,Travis Knight,LAL,22.0,213.36,106.594,Connecticut,USA,1996,1,29,71,4.8,4.5,0.5,6.2,0.127,0.182,0.142,0.536,0.052,1996-97
1,Matt Fish,MIA,27.0,210.82,106.594,North Carolina-Wilmington,USA,1992,2,50,6,0.3,0.8,0.0,-15.1,0.143,0.267,0.265,0.333,0.0,1996-97
2,Matt Bullard,HOU,30.0,208.28,106.594,Iowa,USA,Undrafted,Undrafted,Undrafted,71,4.5,1.6,0.9,0.9,0.016,0.115,0.151,0.535,0.099,1996-97
3,Marty Conlon,BOS,29.0,210.82,111.13,Providence,USA,Undrafted,Undrafted,Undrafted,74,7.8,4.4,1.4,-9.0,0.083,0.152,0.167,0.542,0.101,1996-97
4,Martin Muursepp,DAL,22.0,205.74,106.594,,USA,1996,1,25,42,3.7,1.6,0.5,-14.5,0.109,0.118,0.233,0.482,0.114,1996-97


- Visualization of the all NBA seasons since 1996

In [5]:
#all_seasons_df.groupby(['season']).count()
all_seasons_df['season'].value_counts().sort_index(ascending=False)#.plot.barh()

2020-21    540
2019-20    529
2018-19    530
2017-18    540
2016-17    486
2015-16    476
2014-15    492
2013-14    482
2012-13    469
2011-12    478
2010-11    452
2009-10    442
2008-09    445
2007-08    451
2006-07    458
2005-06    458
2004-05    464
2003-04    442
2002-03    428
2001-02    440
2000-01    441
1999-00    438
1998-99    439
1997-98    439
1996-97    441
Name: season, dtype: int64

- Number of seasons played by the each player

In [6]:
all_seasons_df['player_name'].value_counts()

Vince Carter      22
Dirk Nowitzki     21
Kobe Bryant       20
Kevin Garnett     20
Jamal Crawford    20
                  ..
Gian Clavell       1
Tito Maddox        1
DeQuan Jones       1
Adonis Jordan      1
Mason Jones        1
Name: player_name, Length: 2333, dtype: int64

- The number of seasons played by Kobe Bryan is confirmed by doing a simple check

In [7]:
# filtering with query method
df = all_seasons_df.query('player_name == "Kobe Bryant"')#, inplace = True
df.shape

(20, 21)

- We make a grouping by player and season

In [8]:

grouped = all_seasons_df.groupby(['player_name', 'season']).count()
grouped.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,draft_number,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
player_name,season,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
A.C. Green,1996-97,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
A.C. Green,1997-98,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
A.C. Green,1998-99,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
A.C. Green,1999-00,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
A.C. Green,2000-01,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
A.J. Bramlett,1999-00,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
A.J. Guyton,2000-01,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
A.J. Guyton,2001-02,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
A.J. Guyton,2002-03,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
AJ Hammons,2016-17,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


2. Display the number of rows and columns in the database.

In [9]:
all_seasons_df.shape

(11700, 21)

In [10]:
# Checking for null values
all_seasons_df.dropna(inplace=True)
all_seasons_df.shape

(11700, 21)

In [11]:
# Checking for duplicate values; False => NO duplicated values
all_seasons_df.duplicated().values.any()

False

In [12]:
all_seasons_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11700 entries, 0 to 11699
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   player_name        11700 non-null  object 
 1   team_abbreviation  11700 non-null  object 
 2   age                11700 non-null  float64
 3   player_height      11700 non-null  float64
 4   player_weight      11700 non-null  float64
 5   college            11700 non-null  object 
 6   country            11700 non-null  object 
 7   draft_year         11700 non-null  object 
 8   draft_round        11700 non-null  object 
 9   draft_number       11700 non-null  object 
 10  gp                 11700 non-null  int64  
 11  pts                11700 non-null  float64
 12  reb                11700 non-null  float64
 13  ast                11700 non-null  float64
 14  net_rating         11700 non-null  float64
 15  oreb_pct           11700 non-null  float64
 16  dreb_pct           117

In [13]:
all_seasons_df.describe()

Unnamed: 0,age,player_height,player_weight,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
count,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0
mean,27.132,200.729,100.527,51.717,8.169,3.565,1.811,-2.166,0.055,0.142,0.185,0.51,0.131
std,4.34,9.17,12.526,24.985,5.956,2.487,1.792,12.077,0.044,0.063,0.053,0.098,0.094
min,18.0,160.02,60.328,1.0,0.0,0.0,0.0,-200.0,0.0,0.0,0.0,0.0,0.0
25%,24.0,193.04,90.718,32.0,3.6,1.8,0.6,-6.3,0.021,0.096,0.15,0.479,0.065
50%,26.0,200.66,99.79,58.0,6.7,3.0,1.2,-1.3,0.042,0.132,0.182,0.523,0.103
75%,30.0,208.28,108.862,74.0,11.5,4.7,2.4,3.2,0.084,0.18,0.218,0.559,0.178
max,44.0,231.14,163.293,85.0,36.1,16.3,11.7,300.0,1.0,1.0,1.0,1.5,1.0


- We set the index of the data frame by season. 

In [14]:
season_wise_df = all_seasons_df.set_index(['player_name','season'])

# Set undrafted to null
Undrafted = season_wise_df[season_wise_df['draft_year'] == 'Undrafted']
season_wise_df['draft_year'] = season_wise_df['draft_year'].replace('Undrafted', np.NaN) 
season_wise_df['draft_round'] = season_wise_df['draft_round'].replace('Undrafted', np.NaN)
season_wise_df['draft_number'] = season_wise_df['draft_number'].replace('Undrafted', np.NaN)
season_wise_df

Unnamed: 0_level_0,Unnamed: 1_level_0,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,draft_number,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
player_name,season,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Travis Knight,1996-97,LAL,22.000,213.360,106.594,Connecticut,USA,1996,1,29,71,4.800,4.500,0.500,6.200,0.127,0.182,0.142,0.536,0.052
Matt Fish,1996-97,MIA,27.000,210.820,106.594,North Carolina-Wilmington,USA,1992,2,50,6,0.300,0.800,0.000,-15.100,0.143,0.267,0.265,0.333,0.000
Matt Bullard,1996-97,HOU,30.000,208.280,106.594,Iowa,USA,,,,71,4.500,1.600,0.900,0.900,0.016,0.115,0.151,0.535,0.099
Marty Conlon,1996-97,BOS,29.000,210.820,111.130,Providence,USA,,,,74,7.800,4.400,1.400,-9.000,0.083,0.152,0.167,0.542,0.101
Martin Muursepp,1996-97,DAL,22.000,205.740,106.594,,USA,1996,1,25,42,3.700,1.600,0.500,-14.500,0.109,0.118,0.233,0.482,0.114
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Matthew Dellavedova,2020-21,CLE,30.000,190.500,90.718,St.Mary's College of California,Australia,,,,13,2.800,1.800,4.500,-3.100,0.029,0.085,0.125,0.312,0.337
Maurice Harkless,2020-21,SAC,28.000,200.660,99.790,St. John's,USA,2012,1,15,37,5.200,2.400,1.200,-2.900,0.017,0.097,0.114,0.527,0.071
Max Strus,2020-21,MIA,25.000,195.580,97.522,DePaul,USA,,,,39,6.100,1.100,0.600,-4.200,0.011,0.073,0.179,0.597,0.074
Marcus Morris Sr.,2020-21,LAC,31.000,203.200,98.883,Kansas,USA,2011,1,14,57,13.400,4.100,1.000,4.200,0.025,0.133,0.194,0.614,0.056


In [15]:
# selecting columns required for analysis
cols_analysis = ['age','player_height','player_weight','gp','pts','reb','ast','net_rating','oreb_pct','dreb_pct','usg_pct','ts_pct','ast_pct']
analysis_sampling_df = season_wise_df[cols_analysis]
analysis_sampling_df

Unnamed: 0_level_0,Unnamed: 1_level_0,age,player_height,player_weight,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
player_name,season,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Travis Knight,1996-97,22.000,213.360,106.594,71,4.800,4.500,0.500,6.200,0.127,0.182,0.142,0.536,0.052
Matt Fish,1996-97,27.000,210.820,106.594,6,0.300,0.800,0.000,-15.100,0.143,0.267,0.265,0.333,0.000
Matt Bullard,1996-97,30.000,208.280,106.594,71,4.500,1.600,0.900,0.900,0.016,0.115,0.151,0.535,0.099
Marty Conlon,1996-97,29.000,210.820,111.130,74,7.800,4.400,1.400,-9.000,0.083,0.152,0.167,0.542,0.101
Martin Muursepp,1996-97,22.000,205.740,106.594,42,3.700,1.600,0.500,-14.500,0.109,0.118,0.233,0.482,0.114
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Matthew Dellavedova,2020-21,30.000,190.500,90.718,13,2.800,1.800,4.500,-3.100,0.029,0.085,0.125,0.312,0.337
Maurice Harkless,2020-21,28.000,200.660,99.790,37,5.200,2.400,1.200,-2.900,0.017,0.097,0.114,0.527,0.071
Max Strus,2020-21,25.000,195.580,97.522,39,6.100,1.100,0.600,-4.200,0.011,0.073,0.179,0.597,0.074
Marcus Morris Sr.,2020-21,31.000,203.200,98.883,57,13.400,4.100,1.000,4.200,0.025,0.133,0.194,0.614,0.056


- A simple random sample with n observations is generated.

In [16]:
simple_all_seasons_df = analysis_sampling_df.sample(n = 500).sort_index(ascending = True)
simple_all_seasons_df

Unnamed: 0_level_0,Unnamed: 1_level_0,age,player_height,player_weight,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
player_name,season,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
AJ Price,2012-13,26.0,187.96,83.915,57,7.7,2.0,3.6,-2.3,0.017,0.083,0.181,0.501,0.273
Aaron Brooks,2014-15,30.0,182.88,73.028,82,11.6,2.0,3.2,5.2,0.019,0.078,0.252,0.534,0.245
Aaron Harrison,2016-17,22.0,198.12,95.254,5,0.2,0.6,0.6,-18.6,0.0,0.2,0.142,0.102,0.375
Aaron Holiday,2019-20,23.0,182.88,83.915,66,9.5,2.4,3.4,2.2,0.013,0.077,0.182,0.521,0.188
Adam Keefe,2000-01,31.0,205.74,104.326,67,2.5,3.1,0.5,-9.2,0.105,0.175,0.116,0.45,0.066
Adam Mokoka,2020-21,22.0,193.04,86.182,14,1.1,0.4,0.4,-7.1,0.017,0.077,0.171,0.386,0.179
Al Jefferson,2011-12,27.0,208.28,131.088,61,19.2,9.6,2.2,3.2,0.074,0.248,0.261,0.52,0.113
Al Thornton,2009-10,26.0,203.2,99.79,75,10.7,3.9,1.2,-7.5,0.065,0.105,0.192,0.523,0.072
Alex Len,2020-21,28.0,213.36,113.398,64,6.6,4.1,0.8,-4.0,0.077,0.174,0.161,0.643,0.069
Alexey Shved,2012-13,24.0,198.12,82.554,77,8.6,2.3,3.7,-2.3,0.024,0.084,0.206,0.474,0.248


In [17]:
simple_all_seasons_df.describe()

Unnamed: 0,age,player_height,player_weight,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,27.228,200.98,100.57,52.13,8.72,3.74,1.916,-1.734,0.057,0.146,0.188,0.513,0.133
std,4.199,9.182,12.798,26.964,6.357,2.561,1.974,12.169,0.049,0.071,0.054,0.105,0.098
min,19.0,165.1,60.328,1.0,0.0,0.0,0.0,-100.2,0.0,0.0,0.0,0.0,0.0
25%,24.0,194.945,90.718,28.75,3.575,1.9,0.6,-6.225,0.02,0.096,0.152,0.483,0.064
50%,27.0,203.2,99.79,61.0,7.2,3.1,1.2,-0.9,0.045,0.13,0.182,0.528,0.102
75%,30.0,208.28,109.883,76.0,11.925,4.925,2.5,4.2,0.083,0.188,0.219,0.566,0.183
max,42.0,228.6,142.881,83.0,32.0,13.9,11.6,79.0,0.4,0.6,0.408,1.064,0.543


- In order to generate a systematic random sample, we first obtain two random numbers to be used as "init" and "step" to pick the data from the new data set.

In [20]:
init = np.random.randint(1, 50)
step = np.random.randint(1, 50)

- Compute the length of the array

In [18]:
end  = len(analysis_sampling_df) 

In [21]:
print('init:',init,'end:',end,'step:',step)

init: 13 end: 11700 step: 5


- Generate a systematic random sample:

In [22]:
systematic_all_seasons_df = analysis_sampling_df[init : end : step]
systematic_all_seasons_df

Unnamed: 0_level_0,Unnamed: 1_level_0,age,player_height,player_weight,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
player_name,season,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Mark Bradtke,1996-97,28.000,208.280,120.202,36,1.600,1.900,0.200,0.900,0.107,0.176,0.118,0.463,0.043
Malik Rose,1996-97,22.000,200.660,113.398,54,3.000,3.000,0.600,1.300,0.169,0.219,0.161,0.515,0.089
Lou Roe,1996-97,24.000,200.660,99.790,17,2.400,0.800,0.400,-14.400,0.054,0.069,0.249,0.355,0.103
Pervis Ellison,1996-97,30.000,205.740,95.254,6,2.500,4.300,0.700,1.200,0.088,0.167,0.090,0.412,0.040
Oliver Miller,1996-97,27.000,205.740,140.614,61,4.800,5.000,1.400,-1.000,0.107,0.204,0.141,0.539,0.116
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Malik Beasley,2020-21,24.000,193.040,84.822,37,19.600,4.400,2.400,-7.200,0.021,0.112,0.235,0.570,0.114
Marc Gasol,2020-21,36.000,210.820,115.666,52,5.000,4.100,2.100,4.300,0.039,0.174,0.117,0.606,0.145
Marques Bolden,2020-21,23.000,208.280,112.944,6,1.200,1.000,0.000,12.300,0.083,0.088,0.136,0.537,0.000
Matisse Thybulle,2020-21,24.000,195.580,91.172,65,3.900,1.900,1.000,4.600,0.023,0.070,0.090,0.508,0.064


In [23]:
systematic_all_seasons_df.describe()

Unnamed: 0,age,player_height,player_weight,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
count,2338.0,2338.0,2338.0,2338.0,2338.0,2338.0,2338.0,2338.0,2338.0,2338.0,2338.0,2338.0,2338.0
mean,27.004,200.745,100.492,51.465,8.057,3.528,1.796,-2.397,0.055,0.142,0.186,0.508,0.131
std,4.296,9.3,12.463,25.115,5.918,2.465,1.79,11.83,0.043,0.061,0.056,0.097,0.095
min,19.0,165.1,60.328,1.0,0.0,0.0,0.0,-200.0,0.0,0.0,0.0,0.0,0.0
25%,24.0,193.04,90.718,31.0,3.5,1.7,0.6,-6.3,0.021,0.095,0.15,0.477,0.066
50%,26.0,200.66,99.79,58.0,6.6,2.9,1.2,-1.4,0.043,0.133,0.182,0.52,0.103
75%,30.0,208.28,108.862,74.0,11.3,4.7,2.4,3.0,0.085,0.184,0.218,0.558,0.178
max,42.0,231.14,163.293,84.0,35.4,16.0,11.6,88.5,0.5,0.571,1.0,1.042,0.75


3. Display 'Manu Ginobili' point averages for each of his seasons in the database.

Nested lists, deeper levels:

\---- leave here an empty row
   
  * first level A item - no space in front the bullet character
   * second level Aa item - 1 space is enough
       * third level Aaa item - 5 spaces min
      * second level Ab item - 4 spaces possible too
  * first level B item
  * 
[comment]: # (Esto son dos lineas en blanco!!)
<br/><br/>

&nbsp;&nbsp;

$~$

Line one\
\
\
\
Line two

` `  
` `  

[comment]: # (Esto es un comentario!!)

<!-- (<-- two spaces) <br/> \ &nbsp; Shift + Enter -->

Sources:
   - [loc-iloc](https://www.geeksforgeeks.org/select-rows-columns-by-name-or-index-in-pandas-dataframe-using-loc-iloc/)  
   - [loc-iloc](https://www.shanelynn.ie/pandas-iloc-loc-select-rows-and-columns-dataframe/)  
   - [Multiindex](https://stackoverflow.com/questions/53927460/select-rows-in-pandas-multiindex-dataframe)

In [32]:
#analysis_sampling_df.loc[analysis_sampling_df.player_name == 'Manu Ginobili', ['player_name', 'pts', 'season']]
#analysis_sampling_df.loc[['Manu Ginobili'],['pts']]
#analysis_sampling_df.loc[['Manu Ginobili', 'LeBron James'],['pts']]
analysis_sampling_df.loc['Manu Ginobili',['pts']].head(10)
#analysis_sampling_df.loc[[('Manu Ginobili','2002-03')],['pts','pts','reb','ast']]

Unnamed: 0_level_0,pts
season,Unnamed: 1_level_1
2002-03,7.6
2003-04,12.8
2004-05,16.0
2005-06,15.1
2006-07,16.5
2007-08,19.5
2008-09,15.5
2009-10,16.5
2010-11,17.4
2011-12,12.9


4. List all the rows for Argentina players.

In [None]:
all_seasons_df.query("country == 'Argentina'")

5. Calculate the average assists per season for 'Facundo Campazzo'

In [None]:
all_seasons_df.loc[all_seasons_df.player_name == 'Facundo Campazzo'].ast.mean()

6. Group the dataset by player then season in ascending order.

In [None]:
grouped = all_seasons_df.groupby(['player_name', 'season'])
grouped.first()

7. For each player, determine the number of seasons they have played since 1996.

In [None]:
seasons_played = all_seasons_df.player_name.value_counts()
seasons_played.head()

8. 'Manu Ginobili' average points per season.

In [None]:
import matplotlib.pyplot as plt

ginobili_statistics = all_seasons_df.loc[all_seasons_df.player_name == 'Manu Ginobili', ['pts','season']]
plt.plot(ginobili_statistics["season"], ginobili_statistics["pts"])
plt.ylabel('Points Per Game')
plt.xticks(rotation=90)
plt.title("Manu Ginobili Average Points Per Season")
plt.xlabel('Season')
plt.show()

9. Sort the dataset descending by rebounds per game and display the top 10 single season performances.

In [None]:
all_seasons_df['pts'].sort_values(ascending=False).head(10)

10. What is the highest points per game that Kobe Bryant achieved in a single season?

In [None]:
all_seasons2 = all_seasons_df.loc[all_seasons_df.player_name == 'Kobe Bryant']
all_seasons2['pts'].max()

**Nivell 2**

- Exercici 2 :

Continua amb el conjunt de dades de tema esportiu i genera una mostra estratificada i una mostra utilitzant SMOTE (Synthetic Minority Oversampling Technique).


### Stratified sampling

In [None]:
all_seasons_df.columns

- The dataset contains the birthplace of each player who has made an NBA team's roster. It could be interesting to see the geographic diversity of the NBA and the influence of foreign talent.  

In [None]:
all_seasons_df['country'].unique()

In [None]:
all_seasons_df['country'].value_counts()

- First, we eliminate countries that have contributed a single player because the smallest class must contain at least 2 members to generate a stratified sample.
> ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

In [None]:
main_influential_countries_df = all_seasons_df.groupby("country").filter(lambda g: g.country.size >= 2)

In [None]:

main_influential_countries_df['country'].value_counts().head(10)

In [None]:
# Another alternative to do the same:
cnt = all_seasons_df.country.value_counts()
v = cnt[(cnt.index != 0) & (cnt >= 2)].index.values
out_df = all_seasons_df.query("country in @v")
out_df['country'].value_counts().head(10)

In [None]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest = train_test_split(main_influential_countries_df, test_size = 0.20, stratify = main_influential_countries_df[['country']])
Xtrain

In [None]:
Xtrain['country'].unique()

In [None]:
Xtrain['country'].describe()

### Sampling using the synthetic minority oversampling technique (SMOTE).

In [None]:
import imblearn
from imblearn.over_sampling import SMOTE

In [None]:
smote_all_seasons_df = all_seasons_df[(all_seasons_df['country'] == 'USA') | (all_seasons_df['country'] == 'Argentina')]
smote_all_seasons_df

In [None]:
smote_all_seasons_df['country'].unique()

In [None]:
smote_all_seasons_df.columns

In [None]:
cat_cols = ['player_name', 'team_abbreviation', 'college', 'draft_year', 'draft_round', 'draft_number', 'season']
num_cols = ['age', 'player_height', 'player_weight', ]'gp', 'pts', 'reb', 'ast', 'net_rating', 'oreb_pct', 'dreb_pct', \
    'usg_pct', 'ts_pct', 'ast_pct'

In [None]:
smote_all_seasons_df[num_cols].head()

In [None]:
smote_all_seasons_df[cat_cols].head()

In [None]:
smote_sampling_all_seasons_df = smote_all_seasons_df.drop(columns = cat_cols)


In [None]:
smote_sampling_all_seasons_df.head()

In [None]:
smote_sampling_all_seasons_df['country'] = [0 if x == 'USA' else 1 for x in smote_sampling_all_seasons_df['country']]

In [None]:
smote_sampling_all_seasons_df.head()

In [None]:
smote_sampling_all_seasons_df['country'].value_counts()

In [None]:
smote_sampling_all_seasons_df['country'].describe().apply("{0:.3f}".format)

In [None]:
smote_sampling_all_seasons_df.describe()

In [None]:
smote = SMOTE(sampling_strategy = 'minority')
X_sm, y_sm = smote.fit_resample(smote_sampling_all_seasons_df, smote_sampling_all_seasons_df['country'])

In [None]:
X_sm

In [None]:
X_sm['country'].value_counts()

**Nivell 3**

- Exercici 3 :

Continua amb el conjunt de dades de tema esportiu i genera una mostra utilitzant el mètode Reservoir sampling.

In [None]:
all_seasons_df.head()

[Reservoir Sampling](https://mlwhiz.com/blog/2019/07/30/sampling/?utm_campaign=News&utm_medium=Community&utm_source=DataCamp.com)

In [None]:
import random

def generator(max):
    number = 1
    while number < max:
        number += 1
        yield number

stream = all_seasons_df['gp']

k=100
reservoir = []
for i, element in enumerate(all_seasons_df['gp']):
    if i+1<= k:
        reservoir.append(element)
    else:
        probability = k/(i+1)
        if random.random() < probability:
           # Select item in stream and remove one of the k items already selected
           reservoir[random.choice(range(0,k))] = element

- Since k = 100 has been set, it is verified that the 100 expected values are obtained.

In [None]:
print(reservoir)

In [None]:
len(reservoir)

In [None]:
a, b, c = min(reservoir), max(reservoir), sum(reservoir)/len(reservoir)
print('The minimum {}, maximum {} and average number of games played per player and season are {}.'.format(a, b, c))

In [None]:
all_seasons_df.loc[all_seasons_df.player_name == 'LeBron James', ['player_name', 'gp', 'season']]

In [None]:
all_seasons_df.player_name.isin(['LeBron James'])

- Select rows based on column value:

In [None]:
#To select rows whose column value equals a scalar, some_value, use ==:
all_seasons_df.loc[all_seasons_df['player_name'] == 'LeBron James']

- Select rows whose column value is in an iterable array:

In [None]:
#To select rows whose column value is in an iterable array, which we'll define as array, you can use isin:
array = ['LeBron James', 'Manu Ginobili']
all_seasons_df.loc[all_seasons_df['player_name'].isin(array)]

- Select rows based on multiple column conditions:

In [None]:
#To select a row based on multiple conditions you can use &:
array = ['LeBron James', 'Manu Ginobili']
all_seasons_df.loc[(all_seasons_df['gp'] > 70) & all_seasons_df['player_name'].isin(array)]

- Select rows where column does not equal a value:

In [None]:
#To select rows where a column value does not equal a value, use !=:
all_seasons_df.loc[all_seasons_df['player_name'] != 'LeBron James']

- Select rows whose column value is not in an iterable array:

In [None]:
#To return a rows where column value is not in an iterable array, use ~ in front of df:
array = ['LeBron James', 'Manu Ginobili']
all_seasons_df.loc[~all_seasons_df['player_name'].isin(array)]

- Selects rows whose column value is based on a condition:

In [None]:
print(all_seasons_df['player_name'].where(all_seasons_df['gp'] > 70))