## Assignment 3
### Descriptive Statistics - Measures of Central Tendency and variability
Perform the following operations on any open-source dataset (e.g.,
data.csv)
1. Provide summary statistics (mean, median, minimum,
maximum, standard deviation) for a dataset (age, income etc.)
with numeric variables grouped by one of the qualitative
(categorical) variable. For example, if your categorical variable
is age groups and quantitative variable is income, then provide
summary statistics of income grouped by the age groups. Create
a list that contains a numeric value for each response to the
categorical variable.
2. Write a Python program to display some basic statistical details
like percentile, mean, standard deviation etc. of the species of
‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris- versicolor’ of iris.csv
dataset.
Provide the codes with outputs and explain everything that you do in
this step.

## Dataset 1

In [75]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [76]:
df = pd.read_csv("nba.csv")


In [77]:
df.columns

Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
       'College', 'Salary'],
      dtype='object')

In [78]:
df.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [79]:
df.shape

(458, 9)

In [80]:
df.head(40)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


In [81]:
#Checking the missing values or empty cells
df.isnull().sum()

Name         1
Team         1
Number       1
Position     1
Age          1
Height       1
Weight       1
College     85
Salary      12
dtype: int64

In [82]:
print(df["Team"])  

0      Boston Celtics
1      Boston Celtics
2      Boston Celtics
3      Boston Celtics
4      Boston Celtics
            ...      
453         Utah Jazz
454         Utah Jazz
455         Utah Jazz
456         Utah Jazz
457               NaN
Name: Team, Length: 458, dtype: object


In [83]:
# Get the most frequent team
team_mode = df["Team"].mode()[0]  
df["Team"].fillna(team_mode)  # Fill missing values with the most frequent team


0            Boston Celtics
1            Boston Celtics
2            Boston Celtics
3            Boston Celtics
4            Boston Celtics
               ...         
453               Utah Jazz
454               Utah Jazz
455               Utah Jazz
456               Utah Jazz
457    New Orleans Pelicans
Name: Team, Length: 458, dtype: object

In [86]:
#Fill salary and college missing rows with median and mean resp
salary_mean = df["Salary"].mean(axis = 0)
df["Salary"] = df["Salary"].fillna(salary_mean) 

college_mode = df["College"].mode()[0]  
df["College"] = df["College"].fillna(college_mode) 



In [90]:
#Drop other rows having missing values
df.dropna(inplace=True)

In [91]:
df.isnull().sum()

Name        0
Team        0
Number      0
Position    0
Age         0
Height      0
Weight      0
College     0
Salary      0
dtype: int64

In [92]:
#Checking minimum and maximum age in the dataset
max_age = df["Age"].max()
print("Maximum age in the dataset is: ",max_age)
min_age = df["Age"].min()
print("Minimum age in the dataset is: ",min_age)


Maximum age in the dataset is:  40.0
Minimum age in the dataset is:  19.0


In [95]:
#Group the data by Age column
bins = [19,25,31,36,40]
labels = ["19-24", "25-30", "31-35", "36-40"]
df["AgeGroup"] = pd.cut(df["Age"], bins=bins, labels=labels)
# Group the DataFrame by the "AgeGroup" column
age_groups = df.groupby("AgeGroup")

  age_groups = df.groupby("AgeGroup")


In [96]:
#Display salary by age-group
age_groups["Salary"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
AgeGroup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19-24,197.0,3041840.0,3552554.0,30888.0,947276.0,1662360.0,3533333.0,16407501.0
25-30,189.0,6635459.0,5825556.0,55722.0,1500000.0,4842684.0,10151612.0,22970500.0
31-35,56.0,5113016.0,5283815.0,200600.0,1323709.25,3646250.0,6350000.0,22875000.0
36-40,13.0,5351744.0,6508388.0,222888.0,947726.0,4088019.0,5250000.0,25000000.0


In [98]:
#Group the dataframe by height
height_groups = df.groupby(df["Height"])
#Display summary statistics of Salary by height-group column in the dataset
height_groups["Salary"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Height,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
5-11,3.0,589155.3,792662.7,55722.0,133733.0,211744.0,855872.0,1500000.0
5-9,1.0,6912869.0,,6912869.0,6912869.0,6912869.0,6912869.0,6912869.0
6-0,10.0,5784075.0,6337144.0,947276.0,2437500.0,3934473.5,4846419.0,21468695.0
6-1,16.0,5217919.0,4286013.0,700902.0,1646160.0,3402626.5,8633373.0,13500000.0
6-10,47.0,5185375.0,5063120.0,222888.0,1054584.5,3815000.0,7025766.0,19689000.0
6-11,40.0,6544397.0,6906416.0,245177.0,1362370.0,3107656.0,11438040.0,22359364.0
6-2,16.0,3523777.0,3631376.0,525093.0,947276.0,1553220.0,4882013.0,13437500.0
6-3,33.0,5821784.0,5668225.0,189455.0,1662360.0,4053446.0,8000000.0,20093064.0
6-4,29.0,4646163.0,5275308.0,134215.0,1015421.0,2525160.0,5192520.0,20000000.0
6-5,32.0,4391786.0,4114296.0,55722.0,1160040.0,3129420.0,6015152.0,16407500.0


In [109]:
#Grouping the dataframe by Team categorical variable
teamgroup = df.groupby('Team')
#Display salary statistics grouped by teamgroup variable
teamgroup["Salary"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Atlanta Hawks,15.0,4860197.0,5194508.0,525093.0,1152260.0,2854940.0,6873240.0,18671659.0
Boston Celtics,15.0,4225583.0,3036396.0,1148640.0,1994760.0,3425510.0,5898058.0,12000000.0
Brooklyn Nets,15.0,3501898.0,5317817.0,134215.0,947276.0,1335480.0,2512675.0,19689000.0
Charlotte Hornets,15.0,5222728.0,4538601.0,189455.0,1543138.0,4204200.0,6665702.0,13500000.0
Chicago Bulls,15.0,5785559.0,6251088.0,525093.0,1203290.5,2380440.0,7974380.0,20093064.0
Cleveland Cavaliers,15.0,7455425.0,7484116.0,111196.0,1211638.0,4950000.0,11624820.0,22970500.0
Dallas Mavericks,15.0,4746582.0,5030279.0,525093.0,1185783.0,3950313.0,5289487.0,16407500.0
Denver Nuggets,15.0,4330974.0,4165468.0,258489.0,1647099.5,3000000.0,4593842.0,14000000.0
Detroit Pistons,15.0,4477884.0,4668478.0,111444.0,1711452.5,2891760.0,5635000.0,16000000.0
Golden State Warriors,15.0,5924600.0,5664282.0,289755.0,1201462.0,3815000.0,11540620.0,15501000.0


In [110]:
#Get particular team group salary stat
print("The Salary statistics of Chicago Bulls are:")
teamgroup.get_group('Chicago Bulls')

The Salary statistics of Chicago Bulls are:


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,AgeGroup,Num_position
151,Cameron Bairstow,Chicago Bulls,41.0,PF,25.0,6-9,250.0,New Mexico,845059.0,19-24,4
152,Aaron Brooks,Chicago Bulls,0.0,PG,31.0,6-0,161.0,Oregon,2250000.0,25-30,1
153,Jimmy Butler,Chicago Bulls,21.0,SG,26.0,6-7,220.0,Marquette,16407500.0,25-30,2
154,Mike Dunleavy,Chicago Bulls,34.0,SG,35.0,6-9,230.0,Duke,4500000.0,31-35,2
155,Cristiano Felicio,Chicago Bulls,6.0,PF,23.0,6-10,275.0,Kentucky,525093.0,19-24,4
156,Pau Gasol,Chicago Bulls,16.0,C,35.0,7-0,250.0,Kentucky,7448760.0,31-35,5
157,Taj Gibson,Chicago Bulls,22.0,PF,30.0,6-9,225.0,USC,8500000.0,25-30,4
158,Justin Holiday,Chicago Bulls,7.0,SG,27.0,6-6,185.0,Washington,947276.0,25-30,2
159,Doug McDermott,Chicago Bulls,3.0,SF,24.0,6-8,225.0,Creighton,2380440.0,19-24,3
160,Nikola Mirotic,Chicago Bulls,44.0,PF,25.0,6-10,220.0,Kentucky,5543725.0,19-24,4


In [111]:
#Creating a list that contains a numeric value for each response to the categorical variable like Position,Team,college name etc.
#Get the unique values in the position column
unique_pos = df["Position"].unique()
print("The unique position are:",unique_pos)

#Mapping the Position categorical variables wit numerical values
map_position = {"PG": 1, "SG": 2, "SF": 3, "PF": 4, "C": 5}
df["Num_position"] = df["Position"].map(map_position)

#Convert the mapped numeric position column to list containing numeric values of positions
map_list = df["Num_position"].tolist()

print("The list of positions are:",map_list)

The unique position are: ['PG' 'SF' 'SG' 'PF' 'C']
The list of positions are: [1, 3, 2, 2, 4, 4, 4, 5, 1, 1, 5, 1, 2, 2, 5, 2, 2, 2, 2, 1, 2, 2, 1, 5, 4, 4, 4, 5, 1, 4, 2, 4, 3, 3, 1, 3, 2, 1, 5, 4, 4, 5, 3, 2, 4, 2, 4, 1, 3, 5, 3, 4, 4, 1, 1, 4, 5, 1, 2, 2, 4, 5, 3, 3, 2, 4, 1, 1, 5, 4, 2, 3, 4, 4, 5, 1, 2, 3, 5, 2, 1, 5, 4, 3, 1, 3, 3, 3, 5, 2, 4, 5, 4, 2, 3, 3, 4, 3, 5, 4, 1, 3, 1, 2, 1, 2, 4, 5, 3, 3, 1, 5, 1, 4, 4, 4, 1, 5, 2, 3, 3, 1, 2, 3, 5, 2, 2, 1, 5, 4, 1, 1, 4, 3, 3, 5, 3, 2, 2, 3, 3, 5, 1, 5, 2, 4, 3, 5, 2, 4, 1, 4, 1, 2, 2, 4, 5, 4, 2, 3, 4, 2, 5, 4, 1, 3, 1, 4, 1, 3, 3, 2, 2, 5, 4, 2, 5, 2, 2, 5, 1, 5, 5, 1, 1, 3, 2, 1, 5, 3, 3, 1, 3, 2, 4, 4, 4, 4, 2, 3, 1, 5, 3, 1, 5, 3, 2, 1, 4, 4, 1, 3, 1, 1, 2, 1, 4, 3, 2, 2, 5, 3, 4, 4, 5, 1, 2, 2, 1, 3, 1, 1, 4, 2, 5, 5, 4, 5, 3, 4, 4, 1, 3, 3, 1, 2, 4, 3, 1, 2, 4, 5, 4, 2, 4, 5, 2, 2, 2, 4, 3, 2, 1, 1, 1, 5, 4, 3, 4, 1, 1, 4, 3, 4, 4, 5, 4, 5, 3, 1, 4, 4, 2, 1, 3, 2, 1, 3, 2, 2, 1, 2, 5, 3, 4, 3, 5, 5, 5, 2, 2, 3, 5, 2, 1, 1, 1, 

## Dataset-2 Iris.csv

In [149]:
from sklearn import datasets
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [150]:
#It contains 150 samples of 3 iris species with 4 features: sepal length, sepal width, petal length, petal width. 
iris = datasets.load_iris()

In [151]:
# Name the columns of the dataset
df2 = pd.DataFrame(data = iris.data,columns = iris.feature_names)
#Adding the species feature column in the dataframe
df2["Species"] = iris.target_names[iris.target]
#Printing the data
df2.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [152]:
df2.dtypes

sepal length (cm)    float64
sepal width (cm)     float64
petal length (cm)    float64
petal width (cm)     float64
Species               object
dtype: object

In [153]:
df2.shape

(150, 5)

In [154]:
#Renaming the columns of the dataframe
df2 = df2.rename( columns = {
     "sepal length (cm)": "sepal_length",
    "sepal width (cm)": "sepal_width",
    "petal length (cm)": "petal_length",
    "petal width (cm)": "petal_width"
}
    
)
df2.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [155]:
#Checking the missing values in the data
df2.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
Species         0
dtype: int64

### Method-1: Statistical values of various species

In [156]:
#Group the data by Species
species_group = df2.groupby("Species")

In [157]:
#Display the Statistical values of a particular feature grouped by Species column. E.g. here we are displaying the stat of sepal length(grouped by species) for all species
species_group["sepal_length"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,5.006,0.35249,4.3,4.8,5.0,5.2,5.8
versicolor,50.0,5.936,0.516171,4.9,5.6,5.9,6.3,7.0
virginica,50.0,6.588,0.63588,4.9,6.225,6.5,6.9,7.9


In [158]:
species_group["petal_length"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,1.462,0.173664,1.0,1.4,1.5,1.575,1.9
versicolor,50.0,4.26,0.469911,3.0,4.0,4.35,4.6,5.1
virginica,50.0,5.552,0.551895,4.5,5.1,5.55,5.875,6.9


In [165]:
#Getting the median of a feature
species_group["petal_length"].median()

Species
setosa        1.50
versicolor    4.35
virginica     5.55
Name: petal_length, dtype: float64

### Method-2: Statistical values of various species

In [159]:
#Displaying the stat values of sepal_length of setosa species
df2[ df2["Species"] == "setosa" ]["sepal_length"].describe()

count    50.00000
mean      5.00600
std       0.35249
min       4.30000
25%       4.80000
50%       5.00000
75%       5.20000
max       5.80000
Name: sepal_length, dtype: float64

In [160]:
#Displaying the stat values of sepal_length of versicolor species
df2[ df2["Species"] == "versicolor" ]["sepal_length"].describe()

count    50.000000
mean      5.936000
std       0.516171
min       4.900000
25%       5.600000
50%       5.900000
75%       6.300000
max       7.000000
Name: sepal_length, dtype: float64

In [161]:
#Displaying the stat values of sepal_length of virginica species
df2[ df2["Species"] == "virginica" ]["sepal_length"].describe()

count    50.00000
mean      6.58800
std       0.63588
min       4.90000
25%       6.22500
50%       6.50000
75%       6.90000
max       7.90000
Name: sepal_length, dtype: float64

In [170]:
#Getting the mode of Species
mode_ = df2["Species"].mode()[0]
print("The most frequent species in the dataset is: ",mode_)

The most frequent species in the dataset is:  setosa
