## A3


Perform the following operations on any open-source dataset (e.g. data.csv)

A) Provide summary statistics (mean, median, minimum, maximum, standard
deviation) for a dataset (age, income etc.) with numeric variables grouped by
one of the qualitative (categorical) variable. For example, if your categorical
variable is age groups and quantitative variable is income, then provide
summary statistics of income grouped by the age groups. Create a list that
contains a numeric value for each response to the categorical variable.

B) Write a Python program to display some basic statistical details like percentile,
mean, standard deviation etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’
and ‘Iris- verginica’ of iris.csv dataset.

Provide the codes with outputs and explain everything that you do in this step.

**1) Import Required Libraries**

pandas → data handling

numpy → numerical operations

matplotlib & seaborn → graphs/plots

In [22]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

**2) Load the Dataset**

In [23]:
# Load the NBA dataset
df1 = pd.read_csv("nba.csv")

In [24]:
# Display first 5 rows
df1.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


**3) Understand Dataset Structure**

In [25]:
# Dimensions of dataset
df1.shape

(458, 9)

In [26]:
# Datatypes of each column
df1.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [27]:
# Display info of dataset
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       457 non-null    float64
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   373 non-null    object 
 8   Salary    446 non-null    float64
dtypes: float64(4), object(5)
memory usage: 32.3+ KB


In [28]:
# Display statistical summary
df1.describe()

Unnamed: 0,Number,Age,Weight,Salary
count,457.0,457.0,457.0,446.0
mean,17.678337,26.938731,221.522976,4842684.0
std,15.96609,4.404016,26.368343,5229238.0
min,0.0,19.0,161.0,30888.0
25%,5.0,24.0,200.0,1044792.0
50%,13.0,26.0,220.0,2839073.0
75%,25.0,30.0,240.0,6500000.0
max,99.0,40.0,307.0,25000000.0


**4) Check for missing values**

In [29]:
# Check for missing values
df1.isnull().sum()

Name         1
Team         1
Number       1
Position     1
Age          1
Height       1
Weight       1
College     85
Salary      12
dtype: int64

In [30]:
# Percentage of missing values
df1.isnull().mean() * 100

Name         0.218341
Team         0.218341
Number       0.218341
Position     0.218341
Age          0.218341
Height       0.218341
Weight       0.218341
College     18.558952
Salary       2.620087
dtype: float64

**5) Handle Missing Values**

In [31]:
df1['Number'] = df1['Number'].fillna(df1['Number'].mean())
df1['Age'] = df1['Age'].fillna(df1['Age'].mean())
df1['Weight'] = df1['Weight'].fillna(df1['Weight'].mean())
df1['Salary'] = df1['Salary'].fillna(df1['Salary'].mean())

In [32]:
# Check for missing values
df1.isnull().sum()

Name         1
Team         1
Number       0
Position     1
Age          0
Height       1
Weight       0
College     85
Salary       0
dtype: int64

In [55]:
height_groups = df1.groupby(df1["Height"])
height_groups.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7.730337e+06
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6.796117e+06
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,4.842684e+06
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1.148640e+06
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5.000000e+06
...,...,...,...,...,...,...,...,...,...
275,Alexis Ajinca,New Orleans Pelicans,42.0,C,28.0,7-2,248.0,,4.389607e+06
302,Boban Marjanovic,San Antonio Spurs,40.0,C,27.0,7-3,290.0,,1.200000e+06
322,Walter Tavares,Atlanta Hawks,22.0,C,24.0,7-3,260.0,,1.000000e+06
329,Spencer Hawes,Charlotte Hornets,0.0,PF,28.0,7-1,245.0,Washington,6.110034e+06


In [56]:
height_groups["Salary"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Height,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
5-11,3.0,589155.3,792662.7,55722.0,133733.0,211744.0,855872.0,1500000.0
5-9,1.0,6912869.0,,6912869.0,6912869.0,6912869.0,6912869.0,6912869.0
6-0,10.0,5784075.0,6337144.0,947276.0,2437500.0,3934473.5,4846419.0,21468695.0
6-1,16.0,5217919.0,4286013.0,700902.0,1646160.0,3402626.5,8633373.0,13500000.0
6-10,47.0,5185375.0,5063120.0,222888.0,1054584.5,3815000.0,7025766.0,19689000.0
6-11,40.0,6544397.0,6906416.0,245177.0,1362370.0,3107656.0,11438040.0,22359364.0
6-2,16.0,3523777.0,3631376.0,525093.0,947276.0,1553220.0,4882013.0,13437500.0
6-3,33.0,5821784.0,5668225.0,189455.0,1662360.0,4053446.0,8000000.0,20093064.0
6-4,29.0,4646163.0,5275308.0,134215.0,1015421.0,2525160.0,5192520.0,20000000.0
6-5,32.0,4391786.0,4114296.0,55722.0,1160040.0,3129420.0,6015152.0,16407500.0


In [57]:
df1["Age"].max()

40.0

In [58]:
df1["Age"].min()

19.0

In [60]:
bins = [19,25,31,36,40]
labels = ["19-24", "25-30", "31-35", "36-40"]
df1["AgeGroup"] = pd.cut(df1["Age"], bins=bins, labels=labels)

In [61]:
df1["AgeGroup"].value_counts()

AgeGroup
19-24    197
25-30    190
31-35     56
36-40     13
Name: count, dtype: int64

In [64]:
age_groups = df1.groupby("AgeGroup")

  age_groups = df1.groupby("AgeGroup")


In [66]:
age_groups["Salary"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
AgeGroup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19-24,197.0,3041840.0,3552554.0,30888.0,947276.0,1662360.0,3533333.0,16407501.0
25-30,190.0,6626023.0,5811580.0,55722.0,1534210.5,4842684.0,10126209.0,22970500.0
31-35,56.0,5113016.0,5283815.0,200600.0,1323709.25,3646250.0,6350000.0,22875000.0
36-40,13.0,5351744.0,6508388.0,222888.0,947726.0,4088019.0,5250000.0,25000000.0
