# STATISTICAL ANALYSIS IN PYTHON WITH REAL-WORLD DATASETS

Statistical analysis is key to data science and machine learning, helping you extract insights, make predictions, and support decisions. Now, we'll:

- Use Python Libraries : Explore statistical methods with pandas, numpy, and scipy.
- Work with Datasets : Analyze the built-in Diabetes dataset from scikit-learn, and datasets from Kaggle and the UCI Machine Learning Repository.

This shows how to perform and interpret basic statistical analyses to draw meaningful conclusions from data.

SOME FUNDAMENTAL CONCEPTS :
 -

POPULATION : The entire set of individuals or items that you're interested in studying. It includes all possible observations or measurements for a particular characteristic. For example, if you're studying the heights of all adults in a country, the population would be all adults in that country.

SAMPLE : A subset of the population selected for analysis. It's used when it's impractical or impossible to collect data from the entire population. The sample should ideally represent the population well, allowing you to make inferences or draw conclusions about the population based on the sample. For instance, if you take a survey of 1,000 adults from different regions of the country, this group is your sample.

- NEED FOR SAMPLING : 

 1. COST-EFFECTIVE : Reduces the expense of data collection.
 2. TIME-SAVING: Allows for quicker data analysis compared to studying an entire population.
 3. PRACTICALITY: Makes data collection feasible when it's impractical to survey the whole population.

- BENIFITS OF SAMPLING :
  
 1. RESOURCE EFFICIENCY : Saves money and time while still providing valuable insights.
 2. MANAGEABLE DATA : Eases the complexity of data handling and analysis.
 3. TIMELY INSIGHTS : Facilitates quicker decision-making based on the sample's findings.
 4. ACCURATE ESTIMATES : When done correctly, sampling can provide accurate and reliable estimates about the population.
In essence, sampling helps to efficiently gather and analyze data, making it a practical choice for deriving insights and making informed decisions.

BASIC STATISTICAL METHODS :
 -

- DESCRIPTIVE STATISTICS :

Descriptive statistics are techniques used to summarize and describe the main features of a dataset. They provide a snapshot of the data, making it easier to understand and interpret. Here are some key components of descriptive statistics:

MEASURES OF CENTRAL TENDENCY :

1. Mean : The average value of the dataset.
2. Median : The middle value when the data is sorted in ascending order.
3. Mode : The most frequently occurring value in the dataset.
 
MEASURES OF DISPERSION :

4. Range : The difference between the maximum and minimum values.
5. Viariance : The average squared deviation from the mean.
6. Standard Deviation : The square root of the variance, indicating the spread of data around the mean.

MEASURES OF SHAPE :

7. Skewness : Indicates the asymmetry of the data distribution.
8. Kurtosis : Measures the "tailedness" or peak of the data distribution.
PERCENTILES AND QUARTILES :
10. Percentiles : Values below which a certain percentage of the data falls.
11. Quartiles : Divide the data into four equal parts, including the first quartile (25th percentile), median (50th percentile), and third quartile (75th percentile).

FREQUENCY DISTRIBUTION :

12. Frequency Tables : Show how often each value occurs in the dataset.
13. Histograms : Visual representations of frequency distributions.
    
Descriptive statistics help summarize and present the main characteristics of a dataset in a meaningful way, facilitating better understanding and initial insights.





     

- INFERENTIAL STATISTICS :

Inferential statistics involves using data from a sample to make generalizations or predictions about a population. It extends beyond descriptive statistics by making inferences or drawing conclusions based on statistical analyses. Here are key components of inferential statistics:

HYPOTHESIS TESTING :

1. Null Hypothesis (H₀) : A statement that there is no effect or no difference, used as a baseline for comparison.
2. Alternative Hypothesis (H₁) : A statement that there is an effect or a difference.
3. P-Value : The probability of observing the data given that the null hypothesis is true. A low p-value indicates strong evidence against the null hypothesis.

CONFIDENCE INTERVALS :

A range of values, derived from a sample, that is likely to contain the population parameter with a specified level of confidence (e.g., 95% confidence interval).

ESTIMATION :

1. Point Estimate : A single value estimate of a population parameter (e.g., sample mean as an estimate of the population mean).
2. Interval Estimate : A range of values within which the population parameter is expected to lie.

REGRESSION ANALYSIS :

1. Linear Regression : Models the relationship between a dependent variable and one or more independent variables, used for prediction and understanding relationships.
2. Multiple Regression : Extends linear regression to include multiple independent variables.

ANOVA (Analysis of Variance) :

Tests if there are significant differences between the means of three or more groups.

CORRELATION :

Measures the strength and direction of the relationship between two variables.

SAMPLING DISTRIBUTIONS :

Describes how a sample statistic (e.g., sample mean) would vary from sample to sample if multiple samples were taken from the same population.
Inferential statistics allows you to make predictions, test hypotheses, and draw conclusions about a population based on sample data, providing a way to understand broader trends and relationships.    

SAMPLE STATISTICAL ANALYSIS :
-

- IMPORTING REQUIRED PACKAGES AND MODULES :

In [23]:
pip install scipy

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [25]:
pip install statsmodels

Defaulting to user installation because normal site-packages is not writeable
Collecting statsmodels
  Downloading statsmodels-0.14.2-cp312-cp312-win_amd64.whl.metadata (9.5 kB)
Collecting patsy>=0.5.6 (from statsmodels)
  Downloading patsy-0.5.6-py2.py3-none-any.whl.metadata (3.5 kB)
Downloading statsmodels-0.14.2-cp312-cp312-win_amd64.whl (9.8 MB)
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   - -------------------------------------- 0.3/9.8 MB ? eta -:--:--
   --- ------------------------------------ 0.8/9.8 MB 1.9 MB/s eta 0:00:05
   ---- ----------------------------------- 1.0/9.8 MB 1.9 MB/s eta 0:00:05
   ------ --------------------------------- 1.6/9.8 MB 2.0 MB/s eta 0:00:05
   -------- ------------------------------- 2.1/9.8 MB 1.9 MB/s eta 0:00:05
   --------- ------------------------------ 2.4/9.8 MB 1.8 MB/s eta 0:00:05
   ---------- ----------------------------- 2.6/9.8 MB 1.8 MB

In [26]:
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm

- READ AND CONVERT HEART DISEASE DATASET INTO DATAFRAME :

In [2]:
data=pd.read_csv("heart_disease.csv")

In [3]:
# dataset
df=pd.DataFrame(data)
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


- FEATURES :
  ['age', 'sex', 'chest_pain_type' ("cp"), 'resting_blood_pressure'("trestbps"), 'cholesterol'("chol"), 'fasting_blood_sugar'('fbs'), 'rest_ecg', 'max_heart_rate_achieved','exercise_induced_angina'("exang"), 'st_depression', 'st_slope'("slope"), 'num_major_vessels', 'thalassemia'("thal"), 'condition']
  	

- PERFORMING DESCRIPTIVE STATISTICS :

In [4]:
# 1. MEAN
print("Mean:\n", df.mean())

Mean:
 age          54.434146
sex           0.695610
cp            0.942439
trestbps    131.611707
chol        246.000000
fbs           0.149268
restecg       0.529756
thalach     149.114146
exang         0.336585
oldpeak       1.071512
slope         1.385366
ca            0.754146
thal          2.323902
target        0.513171
dtype: float64


In [5]:
# 2. MEDIAN
print("\nMedian:\n", df.median())


Median:
 age          56.0
sex           1.0
cp            1.0
trestbps    130.0
chol        240.0
fbs           0.0
restecg       1.0
thalach     152.0
exang         0.0
oldpeak       0.8
slope         1.0
ca            0.0
thal          2.0
target        1.0
dtype: float64


In [6]:
# 3.MODE
print("\nMode:\n", df.mode().iloc[0])


Mode:
 age          58.0
sex           1.0
cp            0.0
trestbps    120.0
chol        204.0
fbs           0.0
restecg       1.0
thalach     162.0
exang         0.0
oldpeak       0.0
slope         1.0
ca            0.0
thal          2.0
target        1.0
Name: 0, dtype: float64


In [7]:
# 4.RANGE
print("\nRange:\n", df.max() - df.min())


Range:
 age          48.0
sex           1.0
cp            3.0
trestbps    106.0
chol        438.0
fbs           1.0
restecg       2.0
thalach     131.0
exang         1.0
oldpeak       6.2
slope         2.0
ca            4.0
thal          3.0
target        1.0
dtype: float64


In [8]:
# 5.VARIANCE
print("\nVariance:\n", df.var())


Variance:
 age           82.306450
sex            0.211944
cp             1.060160
trestbps     306.835410
chol        2661.787109
fbs            0.127111
restecg        0.278655
thalach      529.263325
exang          0.223514
oldpeak        1.380750
slope          0.381622
ca             1.062544
thal           0.385219
target         0.250071
dtype: float64


In [9]:
 # 6.STANDARD DEVIATION
print("\nStandard Deviation:\n", df.std())


Standard Deviation:
 age          9.072290
sex          0.460373
cp           1.029641
trestbps    17.516718
chol        51.592510
fbs          0.356527
restecg      0.527878
thalach     23.005724
exang        0.472772
oldpeak      1.175053
slope        0.617755
ca           1.030798
thal         0.620660
target       0.500070
dtype: float64


In [10]:
 # 7. SKEWNESS
print("\nSkewness:\n", df.skew())


Skewness:
 age        -0.248866
sex        -0.851449
cp          0.529455
trestbps    0.739768
chol        1.074073
fbs         1.971339
restecg     0.180440
thalach    -0.513777
exang       0.692655
oldpeak     1.210899
slope      -0.479134
ca          1.261189
thal       -0.524390
target     -0.052778
dtype: float64


In [11]:
# 8.KURTOSIS
print("\nKurtosis:\n", df.kurt())


Kurtosis:
 age        -0.525618
sex        -1.277531
cp         -1.149500
trestbps    0.991221
chol        3.996803
fbs         1.889859
restecg    -1.309614
thalach    -0.088822
exang      -1.523205
oldpeak     1.314471
slope      -0.647129
ca          0.701123
thal        0.250827
target     -2.001123
dtype: float64


In [12]:
# 9. PERCENTILES
numeric_columns = ['age', 'chol']  
percentiles = [25, 50, 75]

for column in numeric_columns:
    print(f"\nPercentiles for '{column}':")
    for percentile in percentiles:
        value = np.percentile(df[column].dropna(), percentile)
        print(f"{percentile}th percentile: {value}")


Percentiles for 'age':
25th percentile: 48.0
50th percentile: 56.0
75th percentile: 61.0

Percentiles for 'chol':
25th percentile: 211.0
50th percentile: 240.0
75th percentile: 275.0


In [13]:
# 10. QUARTILES 
quartiles = df.quantile([0.25, 0.5, 0.75])
print("Quartiles for all numerical columns:")
print(quartiles)

Quartiles for all numerical columns:
       age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0.25  48.0  0.0  0.0     120.0  211.0  0.0      0.0    132.0    0.0      0.0   
0.50  56.0  1.0  1.0     130.0  240.0  0.0      1.0    152.0    0.0      0.8   
0.75  61.0  1.0  2.0     140.0  275.0  0.0      1.0    166.0    1.0      1.8   

      slope   ca  thal  target  
0.25    1.0  0.0   2.0     0.0  
0.50    1.0  0.0   2.0     1.0  
0.75    2.0  1.0   3.0     1.0  


- PERFORMING INFERENTIAL STATISTICS :

HYPOTHESIS TESTING :

Null hypothesis : The mean cholesterol is 200 mg/dL


In [15]:
t_stat, p_value = stats.ttest_1samp(df['chol'], 200)

# Print the t-statistic and p-value
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis (H₀): There is a significant difference in mean cholesterol from 200 mg/dL.")
else:
    print("Fail to reject the null hypothesis (H₀): There is no significant difference in mean cholesterol from 200 mg/dL.")

T-statistic: 28.54520101317123
P-value: 2.5197710109776723e-132
Reject the null hypothesis (H₀): There is a significant difference in mean cholesterol from 200 mg/dL.


CONFIDENCE INTERVAL :

In [17]:
import pandas as pd
import numpy as np
from scipy import stats

# Load the dataset
df = pd.read_csv('heart_disease.csv')

# Let's assume the column 'cholesterol' contains cholesterol levels
cholesterol_data = df['chol']
# Sample mean
mean_cholesterol = np.mean(cholesterol_data)
# Sample standard deviation
std_cholesterol = np.std(cholesterol_data, ddof=1)  # ddof=1 to get sample standard deviation
# Sample size
n = len(cholesterol_data)

# Confidence level (for 95% CI, the Z-score is approximately 1.96)
confidence_level = 0.95
z_score = stats.norm.ppf(1 - (1 - confidence_level) / 2)

# Calculate the margin of error
margin_of_error = z_score * (std_cholesterol / np.sqrt(n))

# Confidence interval
lower_bound = mean_cholesterol - margin_of_error
upper_bound = mean_cholesterol + margin_of_error

# Print results
print(f"Mean Cholesterol: {mean_cholesterol:.2f}")
print(f"95% Confidence Interval: ({lower_bound:.2f}, {upper_bound:.2f})")


Mean Cholesterol: 246.00
95% Confidence Interval: (242.84, 249.16)


CORRELATION :

In [19]:
# Calculate correlation between 'age' and 'cholesterol'
correlation = df['age'].corr(df['chol'])
print(f"Correlation between Age and Cholesterol: {correlation}")

# Correlation matrix for all numerical variables
print(df.corr())

Correlation between Age and Cholesterol: 0.21982253466576027
               age       sex        cp  trestbps      chol       fbs  \
age       1.000000 -0.103240 -0.071966  0.271121  0.219823  0.121243   
sex      -0.103240  1.000000 -0.041119 -0.078974 -0.198258  0.027200   
cp       -0.071966 -0.041119  1.000000  0.038177 -0.081641  0.079294   
trestbps  0.271121 -0.078974  0.038177  1.000000  0.127977  0.181767   
chol      0.219823 -0.198258 -0.081641  0.127977  1.000000  0.026917   
fbs       0.121243  0.027200  0.079294  0.181767  0.026917  1.000000   
restecg  -0.132696 -0.055117  0.043581 -0.123794 -0.147410 -0.104051   
thalach  -0.390227 -0.049365  0.306839 -0.039264 -0.021772 -0.008866   
exang     0.088163  0.139157 -0.401513  0.061197  0.067382  0.049261   
oldpeak   0.208137  0.084687 -0.174733  0.187434  0.064880  0.010859   
slope    -0.169105 -0.026666  0.131633 -0.120445 -0.014248 -0.061902   
ca        0.271551  0.111729 -0.176206  0.104554  0.074259  0.137156   
tha

REGRESSION ANALYSIS :

In [27]:
X = df[['chol']]  # Predictor variable
y = df['trestbps']  # Response variable

# Add a constant to the model (intercept)
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X).fit()

# Print the summary of the model
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:               trestbps   R-squared:                       0.016
Model:                            OLS   Adj. R-squared:                  0.015
Method:                 Least Squares   F-statistic:                     17.03
Date:                Sun, 08 Sep 2024   Prob (F-statistic):           3.97e-05
Time:                        20:09:31   Log-Likelihood:                -4380.2
No. Observations:                1025   AIC:                             8764.
Df Residuals:                    1023   BIC:                             8774.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        120.9228      2.646     45.698      0.0

CONCLUSION :
-

- Descriptive statistics offer insight into the distribution and central tendency of data.
- Confidence intervals provide a range for estimated population parameters.
- Hypothesis testing helps evaluate whether observed deviations from expected values are statistically significant. (It helps determine whether observed data deviates significantly from what is expected under a null hypothesis).
- Regression analysis models relationships between predictors and outcomes, providing valuable insights for prediction and understanding of factors influencing heart disease.

By applying these statistical methods, we gain a comprehensive understanding of the heart disease dataset. We can make informed decisions about health interventions, understand risk factors, and identify areas for further research. Each method contributes to a holistic view of the dataset, aiding in effective data-driven decisions.