## Data Wrangling with Pandas and Statistical Analysis with Scipy

## Part 2 - Statistical Analysis with Scipy

In this section, we will cover the basics of statistical analysis using Scipy. Scipy is a powerful Python library used for scientific and technical computing. We will start with an overview of Scipy, followed by a hands-on introduction to basic statistical analysis techniques such as descriptive statistics, t-tests, correlation, and regression analysis.

## Introduction to Statistical Analysis with Scipy

### Overview of Scipy
Scipy is a Python library used for scientific and technical computing. It builds on Numpy and provides a large collection of functions for numerical integration, optimization, statistics, signal processing, and more. Scipy is particularly valuable for performing complex mathematical operations on large datasets.

### Key Submodules
- **`scipy.stats`:** This module contains a wide range of statistical functions for descriptive and inferential statistics.
- **`scipy.optimize`:** Used for optimization algorithms, including finding the minimum or maximum of a function.
- **`scipy.integrate`:** Provides methods for performing numerical integration.
- **`scipy.interpolate`:** Contains tools for interpolation (estimating data points within the range of a discrete set of known data points).
- **`scipy.spatial`:** Provides functions for working with spatial data structures and algorithms.

### Basic Setup
Before using Scipy, ensure that it is installed in your environment. You can install it via pip:

```bash
pip install scipy

In [1]:
import scipy.stats as stats
import scipy.optimize as optimize
import scipy.integrate as integrate


### Basic Statistical Analysis with Scipy

Here we explore some fundamental statistical analysis techniques using Scipy, including descriptive statistics, t-tests, correlation, and regression analysis.

**Descriptive Statistics**

In [5]:
import numpy as np
import scipy.stats as stats

# Sample data: Heights of a group of individuals (in cm)
data = np.array([167, 170, 169, 171, 173, 174, 168, 172, 175, 169, 120, 180, 178, 172, 171, 169, 173, 172, 171, 170])

# Calculate descriptive statistics
mean = np.mean(data)
median = np.median(data)
mode_result = stats.mode(data)
variance = np.var(data)
std_dev = np.std(data)

# Access mode and count safely
mode_value = mode_result.mode[0] if isinstance(mode_result.mode, np.ndarray) else mode_result.mode
mode_count = mode_result.count[0] if isinstance(mode_result.count, np.ndarray) else mode_result.count

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode_value, "with frequency", mode_count)
print("Variance:", variance)
print("Standard Deviation:", std_dev)

Mean: 169.2
Median: 171.0
Mode: 169 with frequency 3
Variance: 137.06
Standard Deviation: 11.707262703125782


**Descriptive Statistics in higher dimentional data**

In [18]:
import numpy as np
import scipy.stats as stats
import pandas as pd

# List of 15 Nepali cities
cities = [
    "Kathmandu", "Pokhara", "Biratnagar", "Birgunj", "Lalitpur", 
    "Bharatpur", "Hetauda", "Janakpur", "Dharan", "Butwal", 
    "Mahendranagar", "Dhangadhi", "Nepalgunj", "Ghorahi", "Tulsipur"
]

# Generate a dataset of temperatures (in °C) for 15 days across 15 cities
np.random.seed(42)  # For reproducibility
temperatures = {city: np.round(np.random.normal(loc=25, scale=3, size=15), 1) for city in cities}  # Mean=25°C, SD=3°C

# Create a DataFrame to hold the data
df_temperatures = pd.DataFrame(temperatures, index=pd.date_range(start='2024-08-01', periods=15))

# Display the first few rows of the DataFrame
print("Temperature Data for 15 Cities Over 15 Days:")
print(df_temperatures.head())



Temperature Data for 15 Cities Over 15 Days:
            Kathmandu  Pokhara  Biratnagar  Birgunj  Lalitpur  Bharatpur  \
2024-08-01       26.5     23.3        23.2     22.8      23.6       27.5   
2024-08-02       24.6     22.0        30.6     23.6      24.4       25.3   
2024-08-03       26.9     25.9        25.0     28.2      21.7       24.1   
2024-08-04       29.6     22.3        21.8     26.0      21.4       25.3   
2024-08-05       24.3     20.8        27.5     19.7      27.4       19.0   

            Hetauda  Janakpur  Dharan  Butwal  Mahendranagar  Dhangadhi  \
2024-08-01     25.3      26.2    27.4    29.6           25.8       26.2   
2024-08-02     27.9      30.7    22.3    22.7           26.0       27.5   
2024-08-03     22.9      25.5    29.2    24.0           23.0       30.7   
2024-08-04     24.0      25.8    20.8    27.4           25.7       24.3   
2024-08-05     23.8      24.8    26.8    21.3           25.9       22.7   

            Nepalgunj  Ghorahi  Tulsipur  
2024

In [19]:
# Calculate descriptive statistics for one of the Kathmandu
kathmandu_temps = df_temperatures["Kathmandu"]

mean = np.mean(kathmandu_temps)
median = np.median(kathmandu_temps)
mode_result = stats.mode(kathmandu_temps)
variance = np.var(kathmandu_temps)
std_dev = np.std(kathmandu_temps)

# Access mode and count safely
mode_value = mode_result.mode[0] if isinstance(mode_result.mode, np.ndarray) else mode_result.mode
mode_count = mode_result.count[0] if isinstance(mode_result.count, np.ndarray) else mode_result.count

print("\nDescriptive Statistics for Kathmandu:")
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode_value, "with frequency", mode_count)
print("Variance:", variance)
print("Standard Deviation:", std_dev)


Descriptive Statistics for Kathmandu:
Mean: 25.026666666666667
Median: 24.6
Mode: 23.6 with frequency 3
Variance: 8.265955555555554
Standard Deviation: 2.875057487347958


In [20]:
# Calculate descriptive statistics for one of the cities, e.g., Butwal
Butwal_temps = df_temperatures["Butwal"]

mean = np.mean(Butwal_temps)
median = np.median(Butwal_temps)
mode_result = stats.mode(Butwal_temps)
variance = np.var(Butwal_temps)
std_dev = np.std(Butwal_temps)

# Access mode and count safely
mode_value = mode_result.mode[0] if isinstance(mode_result.mode, np.ndarray) else mode_result.mode
mode_count = mode_result.count[0] if isinstance(mode_result.count, np.ndarray) else mode_result.count

print("\nDescriptive Statistics for Butwal:")
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode_value, "with frequency", mode_count)
print("Variance:", variance)
print("Standard Deviation:", std_dev)


Descriptive Statistics for Butwal:
Mean: 24.886666666666667
Median: 25.7
Mode: 21.3 with frequency 2
Variance: 8.319822222222221
Standard Deviation: 2.8844102035290025


## Inferential Statistics: T-Tests


In this section, we will perform t-tests using the Scipy library to determine whether there is a statistically significant difference between the means of two groups. T-tests are widely used in inferential statistics to compare two groups and assess whether their means are significantly different from each other.

### Types of T-Tests:

1.	**Independent T-Test (ttest_ind)**:
- Used to compare the means of two independent groups.
- Example: Comparing the average temperature of two different cities over a period.
2.	**Paired T-Test (ttest_rel)**:
- Used to compare means from the same group at different times (paired data).
- Example: Comparing the average temperature of the same city between two different weeks.
3.	**One-Sample T-Test (ttest_1samp)**:
- Used to test if the mean of a single group is different from a known value.
- Example: Testing if the average temperature in a city is different from a historical average.

#### Example: Independent T-Test

Let’s perform an independent t-test to compare the mean temperatures of two different cities over a 15-day period.

In [25]:
import numpy as np
import scipy.stats as stats
import pandas as pd

# Generate a dataset of temperatures (in °C) for 15 days across 15 cities, rounded to 1 decimal place
np.random.seed(42)  # For reproducibility
temperatures = {
    "Kathmandu": np.round(np.random.normal(loc=25, scale=3, size=15), 1),
    "Pokhara": np.round(np.random.normal(loc=26, scale=3, size=15), 1)
}

# Create a DataFrame to hold the data
df_temperatures = pd.DataFrame(temperatures, index=pd.date_range(start='2024-08-01', periods=15))

# Display the temperature data for Kathmandu and Pokhara
print("Temperature Data for Kathmandu and Pokhara:")
print(df_temperatures)

# Perform an independent t-test
t_stat, p_value = stats.ttest_ind(df_temperatures["Kathmandu"], df_temperatures["Pokhara"])

print("\nIndependent T-Test between Kathmandu and Pokhara:")
print("T-Statistic:", t_stat)
print("P-Value:", p_value)

# Interpretation
if p_value < 0.05:
    print("Result: There is a significant difference in the mean temperatures between Kathmandu and Pokhara.")
else:
    print("Result: There is no significant difference in the mean temperatures between Kathmandu and Pokhara.")

Temperature Data for Kathmandu and Pokhara:
            Kathmandu  Pokhara
2024-08-01       26.5     24.3
2024-08-02       24.6     23.0
2024-08-03       26.9     26.9
2024-08-04       29.6     23.3
2024-08-05       24.3     21.8
2024-08-06       24.3     30.4
2024-08-07       29.7     25.3
2024-08-08       27.3     26.2
2024-08-09       23.6     21.7
2024-08-10       26.6     24.4
2024-08-11       23.6     26.3
2024-08-12       23.6     22.5
2024-08-13       25.7     27.1
2024-08-14       19.3     24.2
2024-08-15       19.8     25.1

Independent T-Test between Kathmandu and Pokhara:
T-Statistic: 0.1982352872256718
P-Value: 0.8442932578984743
Result: There is no significant difference in the mean temperatures between Kathmandu and Pokhara.


#### Explanation:

- **Dataset Generation**:
  - We generate temperature data for Kathmandu and Pokhara over 15 days.
  - The temperatures are simulated using a normal distribution with a mean of 25°C for Kathmandu and 26°C for Pokhara, both with a standard deviation of 3°C.
- **Independent T-Test (ttest_ind)**:
  - The ttest_ind() function from Scipy is used to compare the means of the two independent groups (Kathmandu and Pokhara).
  - The t-statistic tells us the difference between the means in terms of standard errors.
  - The p-value indicates the probability of observing the data if the null hypothesis (that the means are equal) is true.
- **Interpretation**:
  - If the p-value is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference between the mean temperatures of the two cities.
  - If the p-value is greater than 0.05, we fail to reject the null hypothesis, meaning there’s no significant difference between the means.


The **Pearson Correlation Coefficient (r)** is a measure of the linear relationship between two continuous variables. It quantifies the strength and direction of the relationship between the variables, making it a widely used statistic in data analysis.

#### Pearson Correlation Coefficient (r)

The Pearson correlation coefficient between two variables, X and Y, is calculated using the following formula:
![pearson_correlation_formula.png](attachment:pearson_correlation_formula.png)


\[
r = \frac{n(\sum{XY}) - (\sum{X})(\sum{Y})}{\sqrt{[n\sum{X^2} - (\sum{X})^2][n\sum{Y^2} - (\sum{Y})^2]}}
\]

Where:
- \( n \) = Number of data points
- \( \sum{XY} \) = Sum of the product of paired scores (X and Y)
- \( \sum{X} \) = Sum of the X scores
- \( \sum{Y} \) = Sum of the Y scores
- \( \sum{X^2} \) = Sum of the squares of the X scores
- \( \sum{Y^2} \) = Sum of the squares of the Y scores

#### Interpretation of r

- **Range:** The Pearson correlation coefficient ranges from -1 to 1.
  - **r = 1:** Perfect positive linear correlation.
  - **r = 0:** No linear correlation.
  - **r = -1:** Perfect negative linear correlation.

- **Direction:**
  - **Positive r:** Indicates a positive relationship between X and Y (as X increases, Y also increases).
  - **Negative r:** Indicates a negative relationship between X and Y (as X increases, Y decreases).

- **Strength:**
  - **|r| > 0.7:** Strong correlation.
  - **0.3 < |r| < 0.7:** Moderate correlation.
  - **|r| < 0.3:** Weak correlation.

#### Significance Testing

The significance of the Pearson correlation coefficient can be tested using a hypothesis test:

- **Null Hypothesis (H₀):** There is no correlation between the variables (r = 0).
- **Alternative Hypothesis (H₁):** There is a correlation between the variables (r ≠ 0).

The **p-value** resulting from the test indicates the probability of observing the data if the null hypothesis is true:
- **p-value < 0.05:** Reject H₀; conclude that there is a significant correlation.
- **p-value ≥ 0.05:** Fail to reject H₀; conclude that there is no significant correlation.