# Lesson 1: Introduction to Dataset and Basic Statistics

## Objective
Familiarise with the heart attack dataset by exploring its structure and performing initial statistical analysis to gain basic insights into the data.

## Skills Covered
- Importing data with pandas
- Basic data exploration
- Descriptive statistics

---

## Lesson Steps

### Step 1: Importing the Data
Start by importing the necessary library and reading the CSV file into a pandas DataFrame.

In [None]:
import pandas as pd

# Load the dataset
url = 'https://raw.githubusercontent.com/SleeplessOrphan/colabfiles/main/heart.csv'
df = pd.read_csv(url)

# Display the first few rows of the dataframe
print(df.head())

### Step 2: Understanding the Dataset
Use pandas methods to get an understanding of the data's structure.

In [None]:
# Show the structure of the dataset
print(df.info())

# Display summary statistics
print(df.describe())

### Step 3: Basic Statistical Measures
Calculate specific statistical measures for particular columns.

In [None]:
# Calculate the mean age of the subjects
mean_age = df['age'].mean()
print(f"The mean age is: {mean_age:.2f} years")

# Determine the median cholestoral level
median_chol = df['chol'].median()
print(f"The median cholesterol level is: {median_chol} mg/dl")

# Explore the distribution of resting blood pressure 'trtbps'
blood_pressure_counts = df['trtbps'].value_counts()
print(f"Counts of unique values in the 'trtbps' column:\n{blood_pressure_counts}")

### Step 4: Data Visualisation
Introduce a simple plot to visualise the age distribution within the dataset.

In [None]:
import matplotlib.pyplot as plt

# Plot a histogram of the 'age' column
df['age'].plot(kind='hist', bins=20, edgecolor='black')
plt.title('Age Distribution of Subjects')
plt.xlabel('Age')
plt.ylabel('Count')
plt.grid(axis='y', alpha=0.75)
plt.show()

## Wrap-Up
This lesson provided an introduction to the `heart.csv` dataset. You learned how to load the data, inspect its structure, calculate basic statistical measures, and create a simple histogram to visualise the age distribution of the subjects. These foundational skills in data analysis will serve as building blocks for the more complex tasks ahead.

## Reflection Questions
1. What can you infer from the mean and median values calculated?
2. Why might it be important to look at the value counts for blood pressure?
3. How does the histogram of age help us understand the dataset's demographics?

## Independent Tasks for Reinforcement

### Task 1: Measure of Central Tendency for 'trtbps'
Analyse the 'trtbps' (resting blood pressure) data to gain insights into its central tendency.

**Instructions:**
- Calculate the mean, median, and mode for the 'trtbps' column.
- Compare these measures and note any observations or insights you can gain from them.
- Briefly discuss in what scenario each measure might be the most informative.

In [None]:
# Enter your code here


Enter your thoughts below.

### Task 2: Exploring Cholesterol Levels
Dive into the 'chol' (cholesterol) data to understand its distribution.

**Instructions:**
- Create a histogram for the 'chol' column to visualise the distribution of cholesterol levels.
- Adjust the number of bins to better visualise the data.
- Write a short paragraph on what the distribution tells you about the cholesterol levels of the subjects.

In [None]:
# Enter your code here


Enter your thoughts below.

### Task 3: Identifying Potential Outliers in 'oldpeak'
Investigate the 'oldpeak' column for any potential outliers that may impact the analysis.

**Instructions:**
- Create a boxplot for the 'oldpeak' column to visualise the distribution and identify outliers.
- Note any significant observations from the boxplot, such as the interquartile range, any potential outliers, and how spread out the data is.

In [None]:
# Enter your code here


Enter your thoughts below.

## Independent Study Tasks

### Task 1: Dissecting 'sex' Variable
Analyse the 'sex' variable to understand the gender distribution within the dataset.

**Instructions:**
- Determine the number of males and females in the dataset.
- Calculate the proportion of males to females.
- Reflect on how the gender balance might affect the study's outcomes or insights.

In [None]:
# Enter your code here


### Task 2: Investigating 'thalachh' (Maximum Heart Rate Achieved)
Explore the 'thalachh' data for patterns or notable points.

**Instructions:**
- Compute basic statistical measures such as mean and standard deviation for the 'thalachh' column.
- Plot a histogram to observe the distribution of maximum heart rate achieved among the subjects.
- Interpret the skewness of the distribution and consider what that might indicate about the subjects' heart rates.

In [None]:
# Enter your code here


### Task 3: Relationship Between Age and Blood Pressure
Examine the relationship between age and resting blood pressure ('trtbps').

**Instructions:**
- Create a scatter plot to visualise any potential relationship between age and resting blood pressure.
- Identify whether there appears to be a positive, negative, or no correlation between these two variables.
- Contemplate what this relationship might imply about the population's cardiovascular health

In [None]:
# Enter your code here


Example code for the reinforcement tasks is shown below

```python
# Example code snippet for reinforcement tasks


Task 1
mean_trtbps = df['trtbps'].mean()
median_trtbps = df['trtbps'].median()
mode_trtbps = df['trtbps'].mode()[0]  # The mode method returns a Series

print(f"Mean resting blood pressure: {mean_trtbps}")
print(f"Median resting blood pressure: {median_trtbps}")
print(f"Mode of resting blood pressure: {mode_trtbps}")

# Example code snippet for Task 2
df['chol'].plot(kind='hist', bins=30, edgecolor='black')
plt.title('Cholesterol Level Distribution')
plt.xlabel('Cholesterol (mg/dl)')
plt.ylabel('Count')
plt.grid(axis='y', alpha=0.75)
plt.show()

# Example code snippet for Task 3
df['oldpeak'].plot(kind='box')
plt.title('Oldpeak Distribution')
plt.ylabel('Oldpeak')
plt.grid(axis='y', alpha=0.75)
plt.show()
```

