# Applying Statistical Methods to Datasets

## Descriptive Statiscs

### Installing required libraries

In [1]:
!pip install scipy



In [2]:
!pip install scikit-learn statsmodels



### Descriptive statistics

Descriptive statistics help us summarize and describe the main features of a dataset. This includes measures of central tendency, variability, and the shape of the data distribution.

#### Inferential Statistics in Python

Inferential statistics involve drawing conclusions about a population based on a sample. Here are some key techniques:

**Hypothesis Testing:** Used to determine whether there is enough evidence to reject a null hypothesis.

**Confidence Intervals:** Provide a range of values that likely contain the population parameter.

**Regression Analysis:** Examines relationships between variables.

### Exercises for Applying Statistical Methods

#### Analysing a Health Related Dataset

#### 1.Loading the diabetes dataset inbuilt in scikit-learn

In [3]:
#importing packages
import pandas as pd
from sklearn.datasets import load_diabetes
#loading dataset
diabetes=load_diabetes()
df=pd.DataFrame(data=diabetes.data,columns=diabetes.feature_names)
df

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


#### Or downloading data from kaggle and loading in dataframe

In [4]:
df=pd.read_csv('diabetes_012_health_indicators_BRFSS2015.csv')
df.head(5)

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


#### 2.Performing Descriptive Stastiscs

In [5]:
#calculating basic descriptive statistics
print("Mean:\n", df.mean())
print("\nMedian:\n", df.median())
print("\nMode:\n", df.mode().iloc[0])
print("\nStandard Deviation:\n", df.std())
print("\nVariance:\n", df.var())

Mean:
 Diabetes_012             0.296921
HighBP                   0.429001
HighChol                 0.424121
CholCheck                0.962670
BMI                     28.382364
Smoker                   0.443169
Stroke                   0.040571
HeartDiseaseorAttack     0.094186
PhysActivity             0.756544
Fruits                   0.634256
Veggies                  0.811420
HvyAlcoholConsump        0.056197
AnyHealthcare            0.951053
NoDocbcCost              0.084177
GenHlth                  2.511392
MentHlth                 3.184772
PhysHlth                 4.242081
DiffWalk                 0.168224
Sex                      0.440342
Age                      8.032119
Education                5.050434
Income                   6.053875
dtype: float64

Median:
 Diabetes_012             0.0
HighBP                   0.0
HighChol                 0.0
CholCheck                1.0
BMI                     27.0
Smoker                   0.0
Stroke                   0.0
HeartDiseaseorAtt

In [6]:
# Additional descriptive statistics
print("\nRange:\n", df.max() - df.min())
print("\nSkewness:\n", df.skew())
print("\nKurtosis:\n", df.kurt())


Range:
 Diabetes_012             2.0
HighBP                   1.0
HighChol                 1.0
CholCheck                1.0
BMI                     86.0
Smoker                   1.0
Stroke                   1.0
HeartDiseaseorAttack     1.0
PhysActivity             1.0
Fruits                   1.0
Veggies                  1.0
HvyAlcoholConsump        1.0
AnyHealthcare            1.0
NoDocbcCost              1.0
GenHlth                  4.0
MentHlth                30.0
PhysHlth                30.0
DiffWalk                 1.0
Sex                      1.0
Age                     12.0
Education                5.0
Income                   7.0
dtype: float64

Skewness:
 Diabetes_012            1.976390
HighBP                  0.286904
HighChol                0.307075
CholCheck              -4.881271
BMI                     2.122004
Smoker                  0.228810
Stroke                  4.657340
HeartDiseaseorAttack    2.778742
PhysActivity           -1.195546
Fruits                 -0.557

In [7]:
df.columns

Index(['Diabetes_012', 'HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker',
       'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income'],
      dtype='object')

In [8]:
df.describe()

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
count,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,...,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0
mean,0.296921,0.429001,0.424121,0.96267,28.382364,0.443169,0.040571,0.094186,0.756544,0.634256,...,0.951053,0.084177,2.511392,3.184772,4.242081,0.168224,0.440342,8.032119,5.050434,6.053875
std,0.69816,0.494934,0.49421,0.189571,6.608694,0.496761,0.197294,0.292087,0.429169,0.481639,...,0.215759,0.277654,1.068477,7.412847,8.717951,0.374066,0.496429,3.05422,0.985774,2.071148
min,0.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,6.0,4.0,5.0
50%,0.0,0.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,7.0
75%,0.0,1.0,1.0,1.0,31.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,2.0,3.0,0.0,1.0,10.0,6.0,8.0
max,2.0,1.0,1.0,1.0,98.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,1.0,13.0,6.0,8.0


### 2.Performing Inferential Statistics

In [9]:
from scipy import stats

In [10]:
#calculate population mean of bmi
bmi_values=df['BMI']
population_mean=bmi_values.mean()
population_mean

28.382363607694735

In [11]:
#create a random sample using numpy
import numpy as np
sample_size=200
bmi_sample=np.random.choice(bmi_values,sample_size)

In [12]:
bmi_sample

array([24., 41., 30., 34., 28., 23., 27., 26., 37., 24., 46., 49., 32.,
       31., 27., 29., 27., 38., 30., 29., 31., 25., 28., 32., 30., 29.,
       32., 28., 28., 23., 34., 24., 34., 32., 37., 32., 28., 25., 34.,
       37., 27., 29., 35., 34., 21., 39., 22., 28., 25., 27., 33., 30.,
       26., 22., 29., 26., 29., 38., 27., 32., 20., 36., 42., 26., 31.,
       30., 24., 81., 36., 25., 29., 34., 34., 23., 30., 34., 25., 27.,
       27., 34., 27., 30., 24., 22., 25., 23., 32., 27., 29., 27., 26.,
       24., 28., 30., 34., 27., 31., 23., 33., 27., 48., 27., 23., 33.,
       30., 25., 25., 28., 35., 29., 33., 35., 23., 34., 36., 26., 24.,
       28., 21., 30., 25., 29., 24., 24., 35., 37., 22., 22., 34., 30.,
       35., 29., 23., 21., 29., 23., 26., 38., 22., 27., 24., 24., 30.,
       33., 23., 35., 27., 22., 24., 30., 32., 27., 38., 22., 30., 21.,
       22., 40., 37., 18., 22., 47., 25., 31., 27., 25., 22., 33., 30.,
       26., 23., 22., 26., 33., 27., 30., 32., 29., 25., 30., 31

In [13]:
population_mean=bmi_sample.mean()
population_mean

29.325

In [14]:
t_stat, p_value = stats.ttest_1samp(bmi_sample,28)

In [15]:
p_value

0.007405510765241553

sine the p_value is greater than 0.05 we can say that the sample belongs to populayion of mean 28 which actually is..

## Regression Analysis

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

In [17]:
#splitting the features i.e, seperating target variable
X = df.drop(columns=['Diabetes_012'])
y = df['Diabetes_012']

In [19]:
X

Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253675,1.0,1.0,1.0,45.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,0.0,5.0,0.0,1.0,5.0,6.0,7.0
253676,1.0,1.0,1.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,1.0,0.0,11.0,2.0,4.0
253677,0.0,0.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,5.0,2.0
253678,1.0,0.0,1.0,23.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,7.0,5.0,1.0


In [20]:
y

0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
         ... 
253675    0.0
253676    2.0
253677    0.0
253678    0.0
253679    2.0
Name: Diabetes_012, Length: 253680, dtype: float64

In [21]:
#Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [22]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [28]:
logreg = LogisticRegression(max_iter=1000, multi_class='ovr')  # One-vs-rest for multiclass
logreg.fit(X_train, y_train)



In [29]:
y_pred = logreg.predict(X_test)

In [32]:
print(y_pred)

[0. 0. 0. ... 0. 0. 0.]


In [33]:
print("Accuracy Score:", accuracy_score(y_test, y_pred))

Accuracy Score: 0.8483325449385052
