In [39]:
pip install pandas numpy scipy scikit-learn statsmodels





### 1) Loading the Wine Dataset

- Import Libraries: Import pandas for handling data and load_wine to get the wine dataset.- 
Load Dataset: wine = load_wine() loads the dataset
- 
Create DataFrame: df is created to hold the data in a table-like structure with columns named after the features
- . df['target'] adds a column for the target labels (i.e., wine categories
- .
Display Data: print(df.head()) shows the first few rows of the dataset so you can see what it looks like.ke.

In [40]:

import pandas as pd
from sklearn.datasets import load_wine

# Load the dataset
wine = load_wine()
df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
df['target'] = wine.target

# Display the first few rows
print(df.head())


   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  target  
0          

### 2) Performing Descriptive Statistics

- Mean: Average value of each feature.
- Median: Middle value when the data is sorted.
- Mode: Most frequent value. .iloc[0] selects the first mode in case there are multiple.
- Standard Deviation: Measures how spread out the values are from the mean.
- Variance: The square of the standard deviation, showing spread.
- Range: Difference between the maximum and minimum values.
- Skewness: Measures the asymmetry of the data distribution.
- Kurtosis: Measures the "tailedness" of the data distribution.

In [41]:

# Calculate basic descriptive statistics
print("Mean:\n", df.mean())
print("\nMedian:\n", df.median())
print("\nMode:\n", df.mode().iloc[0])
print("\nStandard Deviation:\n", df.std())
print("\nVariance:\n", df.var())

# Additional descriptive statistics
print("\nRange:\n", df.max() - df.min())
print("\nSkewness:\n", df.skew())
print("\nKurtosis:\n", df.kurt())


Mean:
 alcohol                          13.000618
malic_acid                        2.336348
ash                               2.366517
alcalinity_of_ash                19.494944
magnesium                        99.741573
total_phenols                     2.295112
flavanoids                        2.029270
nonflavanoid_phenols              0.361854
proanthocyanins                   1.590899
color_intensity                   5.058090
hue                               0.957449
od280/od315_of_diluted_wines      2.611685
proline                         746.893258
target                            0.938202
dtype: float64

Median:
 alcohol                          13.050
malic_acid                        1.865
ash                               2.360
alcalinity_of_ash                19.500
magnesium                        98.000
total_phenols                     2.355
flavanoids                        2.135
nonflavanoid_phenols              0.340
proanthocyanins                   1.555
color_

### 3) Performing Inferential Statistics

- Import Libraries: Import scipy.stats for statistical tests.
- Select Feature: Choose the 'alcohol' content from the dataset.
- Hypothetical Mean: Set a value (13.0) to compare against.
- One-Sample T-Test: Test if the sample mean of 'alcohol' is different from 13.0. t_stat tells how much the sample mean deviates from the hypothetical mean, and p_value shows the probability that this deviation is due to chance.

In [42]:

from scipy import stats
import pandas as pd
from sklearn.datasets import load_wine

# Select the feature of interest, e.g., 'alcohol'
alcohol_values = df['alcohol']

# Hypothetical population mean for the alcohol content
population_mean = 13.0

# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(alcohol_values, population_mean)

print("T-Statistic:", t_stat)
print("P-Value:", p_value)


T-Statistic: 0.01015592394800969
P-Value: 0.9919083221024861


### 4) Confidence Intervals

- Import Libraries: Import numpy and scipy for calculations.
- Sample Mean & Standard Error: Calculate the average ('sample_mean') and the standard error (how much the sample mean might differ from the true mean).
- Confidence Interval: Calculate a range where you are 95% confident the true mean lies. This gives a sense of the uncertainty around the sample mean.

In [43]:
import numpy as np
from scipy import stats
import pandas as pd

# Sample mean and standard error for the selected feature (alcohol)
sample_mean = np.mean(alcohol_values)
standard_error = stats.sem(alcohol_values)

# Compute the 95% confidence interval for the alcohol content
confidence_interval = stats.norm.interval(0.95, loc=sample_mean, scale=standard_error)

print("95% Confidence Interval for Alcohol Content:", confidence_interval)


95% Confidence Interval for Alcohol Content: (12.881356184655965, 13.119879770400216)


### 5) Regression Analysis

- Import Libraries: Import statsmodels for regression analysis.
- Prepare Data: X is the independent variable (e.g., 'alcohol') with a constant added for the intercept. y is the dependent variable (e.g., wine categories).
- Fit Model: Perform linear regression to model the relationship between the independent and dependent variables.
- Print Summary: Show details about the model, including how well it fits the data and the significance of the relationships.

In [44]:
import statsmodels.api as sm
import pandas as pd

# Define independent variable (add constant for intercept)
X = sm.add_constant(df['alcohol'])  # Replace 'bmi' with 'alcohol' or any other feature from the wine dataset

# Define dependent variable (e.g., target variable in the wine dataset)
y = wine.target

# Fit linear regression model
model = sm.OLS(y, X).fit()

# Print model summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.108
Model:                            OLS   Adj. R-squared:                  0.103
Method:                 Least Squares   F-statistic:                     21.25
Date:                Wed, 04 Sep 2024   Prob (F-statistic):           7.72e-06
Time:                        15:02:34   Log-Likelihood:                -196.56
No. Observations:                 178   AIC:                             397.1
Df Residuals:                     176   BIC:                             403.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0119      0.885      5.660      0.0