<a href="https://colab.research.google.com/github/NookaNaveenth/statistical-analysis/blob/main/naveen18.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction**

Statistical analysis is vital for data science and machine learning, helping to extract insights, predict outcomes, and make informed decisions. This session covers fundamental statistical methods using Python, with libraries such as pandas, numpy, and scipy, and applies these methods to datasets from sklearn, Kaggle, and the UCI Machine Learning Repository.



**Section 1: Basic Concepts in Statistics**

**Population and Sample**

- **Population**: The complete set of observations (e.g., all adults in a city).
- **Sample**: A subset of the population (e.g., 500 randomly selected adults) used for practical reasons like cost and time.

**Need for Sampling**

- **Cost Efficiency**: Reduces data collection costs.
- **Time Efficiency**: Provides quicker insights.
- **Feasibility**: Practical when studying the entire population isn't possible.
- **Manageability**: Simplifies data analysis.

**Benefits of Sampling**

- **Accuracy**: Proper techniques ensure sample results reflect the population.
- **Less Data Overhead**: Easier to analyze compared to large datasets.
- **Focused Research**: Enables detailed analysis.

**Section 2: Understanding Basic Statistical Methods**

**Descriptive Statistics**

- **Mean**: Average value.
- **Median**: Middle value.
- **Mode**: Most frequent value.
- **Standard Deviation (SD)**: Dispersion around the mean.
- **Variance**: Average squared deviation from the mean.
- **Range**: Difference between maximum and minimum values.

**Additional Measures:**

- **Percentiles and Quartiles**: Indicate data spread.
- **Interquartile Range (IQR)**: Range of the middle 50% of values.
- **Skewness**: Asymmetry of data distribution.
- **Kurtosis**: Tailedness of the distribution.

**Inferential Statistics**

- **Hypothesis Testing**: Tests hypotheses about population parameters.
- **Confidence Intervals**: Range likely containing the true population parameter.
- **Regression Analysis**: Models relationships between variables.
- **ANOVA**: Compares means of multiple groups.
- **Chi-Square Test**: Tests independence of categorical variables.



**Section 3: Getting Started with Python for Statistical Analysis**

Install necessary libraries with:

```bash
pip install pandas numpy scipy scikit-learn statsmodels
```



**Section 4: Applying Statistical Methods to Datasets**

**Descriptive Statistics in Python**

Use Python to compute measures like mean, median, and standard deviation to describe dataset features.

**Inferential Statistics in Python**

Apply techniques such as hypothesis testing and regression analysis to draw conclusions beyond the sample data.



**Section 5: Exercises**

**Exercise 1: Health-Related Dataset**

1. Select a dataset from Kaggle or UCI.
2. Load and analyze it using pandas.
3. Calculate mean, median, mode, SD, and variance.
4. Perform a hypothesis test.
5. Compute a 95% confidence interval for a feature.

**Exercise 2: Regression Analysis**

1. Conduct a linear regression analysis on the dataset.
2. Interpret coefficients, p-values, and R-squared.
3. Visualize relationships and regression lines.

**Section 6: Sample Code**                                                          
To help you get started, here’s a comprehensive set of code examples that illustrate how to perform descriptive and inferential statistics using Python



**Loading the Diabetes Dataset**

In [None]:
import pandas as pd
from sklearn.datasets import load_iris

# Load the dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Display the first few rows
print(df.head())


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  


**Performing Descriptive Statistics**

In [None]:
# Calculate basic descriptive statistics
print("Mean:\n", df.mean())
print("\nMedian:\n", df.median())
print("\nMode:\n", df.mode().iloc[0])
print("\nStandard Deviation:\n", df.std())
print("\nVariance:\n", df.var())

# Additional descriptive statistics
print("\nRange:\n", df.max() - df.min())
print("\nSkewness:\n", df.skew())
print("\nKurtosis:\n", df.kurt())



Mean:
 sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
target               1.000000
dtype: float64

Median:
 sepal length (cm)    5.80
sepal width (cm)     3.00
petal length (cm)    4.35
petal width (cm)     1.30
target               1.00
dtype: float64

Mode:
 sepal length (cm)    5.0
sepal width (cm)     3.0
petal length (cm)    1.4
petal width (cm)     0.2
target               0.0
Name: 0, dtype: float64

Standard Deviation:
 sepal length (cm)    0.828066
sepal width (cm)     0.435866
petal length (cm)    1.765298
petal width (cm)     0.762238
target               0.819232
dtype: float64

Variance:
 sepal length (cm)    0.685694
sepal width (cm)     0.189979
petal length (cm)    3.116278
petal width (cm)     0.581006
target               0.671141
dtype: float64

Range:
 sepal length (cm)    3.6
sepal width (cm)     2.4
petal length (cm)    5.9
petal width (cm)     2.4
target               2.0
dtype: float64

Sk

**Performing Inferential Statistics**

In [None]:
from scipy import stats
import pandas as pd
from sklearn.datasets import load_iris

# Load the dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Example data: Sepal length values
sepal_length_values = df['sepal length (cm)']

# Hypothetical population mean for Sepal length
population_mean = 5.0

# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(sepal_length_values, population_mean)

print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")


T-Statistic: 12.473257146694761
P-Value: 6.670742299801927e-25


**Confidence Intervals**

In [None]:
import numpy as np
from scipy import stats
from sklearn.datasets import load_iris
import pandas as pd

# Load the dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Example data: Sepal length values
sepal_length_values = df['sepal length (cm)']

# Sample mean and standard error for Sepal length
sample_mean = np.mean(sepal_length_values)
standard_error = stats.sem(sepal_length_values)

# Compute 95% confidence interval for Sepal length
confidence_interval = stats.norm.interval(0.95, loc=sample_mean, scale=standard_error)

print(f"95% Confidence Interval for Sepal Length: {confidence_interval}")


95% Confidence Interval for Sepal Length: (5.710817588579892, 5.9758490780867755)


**Regression Analysis**

In [None]:
import statsmodels.api as sm
from sklearn.datasets import load_iris
import pandas as pd

# Load the dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Define independent variable (add constant for intercept)
X = sm.add_constant(df['sepal length (cm)'])

# Define dependent variable
y = df['target']

# Fit linear regression model
model = sm.OLS(y, X).fit()

# Print model summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.612
Model:                            OLS   Adj. R-squared:                  0.610
Method:                 Least Squares   F-statistic:                     233.8
Date:                Tue, 03 Sep 2024   Prob (F-statistic):           2.89e-32
Time:                        05:24:52   Log-Likelihood:                -111.35
No. Observations:                 150   AIC:                             226.7
Df Residuals:                     148   BIC:                             232.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -3.5240      0.29

**Conclusion**

This article has provided a thorough introduction to statistical analysis using Python, covering both descriptive and inferential statistics. Understanding these fundamental concepts allows you to effectively summarize data, uncover patterns, and make informed predictions with real-world datasets.

We started by discussing key concepts such as populations, samples, and the importance of sampling. These concepts are crucial for generalizing findings from a sample to a larger population and ensuring that your analyses are accurate and meaningful.

Next, we explored descriptive statistics, which are essential for summarizing and interpreting data. Measures such as the mean, median, mode, standard deviation, variance, and range offer insights into the central tendency, spread, and distribution of data. Additional metrics like percentiles, skewness, and kurtosis provide deeper insights into data characteristics, helping identify trends, outliers, and patterns.

We then covered inferential statistics, which allow for making predictions and drawing conclusions beyond the sample data. Techniques such as hypothesis testing, confidence intervals, regression analysis, ANOVA, and chi-square tests are fundamental for evaluating theories, testing assumptions, and understanding variable relationships. These methods are vital for making data-driven decisions and advancing research.

By leveraging Python libraries like pandas, numpy, scipy, and statsmodels, you can perform comprehensive analyses on real-world datasets. The exercises in this article, involving datasets from Kaggle and the UCI Machine Learning Repository, will further enhance your practical skills and understanding of statistical methods.

As you continue to develop your skills, you'll be better prepared to tackle complex data science challenges, make well-informed decisions, and contribute to impactful research and analysis. Mastering these statistical techniques will significantly enhance your proficiency as a data scientist or researcher.

Keep exploring and experimenting to fully unlock the potential of statistical analysis in your work