# Statistics Project


In this project, we will apply various statistical concepts and techniques to analyze a real-world dataset. The project will cover:

1. **Data Exploration**
2. **Descriptive Statistics**
3. **Data Visualization**
4. **Hypothesis Testing**
5. **Regression Analysis**
6. **Conclusion**

Let's dive into the world of data and uncover interesting insights using statistical methods!


## 1. Data Exploration


We will use the **Iris** dataset for this project. The dataset contains measurements of iris flowers from three different species.

**Attributes:**

- `sepal_length`
- `sepal_width`
- `petal_length`
- `petal_width`
- `species`

**Tasks:**

- Load the dataset.
- Display the first few rows.
- Check for missing values.


In [None]:

# Import necessary libraries
import pandas as pd

# Load the Iris dataset
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

# Display the first five rows
df.head()


In [None]:

# Check for missing values
df.isnull().sum()


## 2. Descriptive Statistics


Calculate the following descriptive statistics for each numerical attribute:

- Mean
- Median
- Variance
- Standard Deviation

**Tasks:**

- Use built-in pandas functions to compute the statistics.
- Interpret the results.


In [None]:

# Calculate descriptive statistics
descriptive_stats = df.describe()
descriptive_stats


## 3. Data Visualization


Visualize the data to identify patterns and relationships.

**Tasks:**

- Create histograms for each numerical attribute.
- Generate scatter plots to observe relationships between attributes.


In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Histograms
df.hist(figsize=(10, 8))
plt.show()


In [None]:

# Pairplot using seaborn
sns.pairplot(df, hue='species')
plt.show()


## 4. Hypothesis Testing


Test whether there is a significant difference in the mean petal length between two species: *setosa* and *versicolor*.

**Tasks:**

- State the null and alternative hypotheses.
- Perform an independent two-sample t-test.
- Interpret the results.


In [None]:

from scipy import stats

# Separate the data
setosa_petal_length = df[df['species'] == 'setosa']['petal_length']
versicolor_petal_length = df[df['species'] == 'versicolor']['petal_length']

# Perform t-test
t_statistic, p_value = stats.ttest_ind(setosa_petal_length, versicolor_petal_length)

print(f"T-statistic: {t_statistic:.2f}")
print(f"P-value: {p_value:.4f}")



**Interpretation:**

- If the p-value is less than 0.05, we reject the null hypothesis.


## 5. Regression Analysis


Perform a linear regression to predict **petal length** based on **petal width**.

**Tasks:**

- Split the data into training and testing sets.
- Fit a linear regression model.
- Evaluate the model's performance.


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Prepare the data
X = df[['petal_width']]
y = df['petal_length']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)


In [None]:

# Evaluate the model
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2:.2f}")



**Interpretation:**

- An R-squared value closer to 1 indicates a better fit.


## 6. Conclusion


Summarize the findings from each section and reflect on the statistical techniques applied.

**Points to Address:**

- Key insights from descriptive statistics and visualizations.
- Results of the hypothesis test and what they imply.
- Performance of the regression model and its applicability.


*(Write your conclusion here.)*