# P&S Module 1 - Lesson 2: Scatter Plots

**Week 1 - Lesson 2 - Video 8**

In this part, we will be plotting scatter plots for quantitative variables.

---

## Scatter Plot

A **scatter plot** is used to display the relationship between two numerical variables. Each point on the graph represents one observation.

**What to look for in a scatter plot:**
- **Direction**: Positive (upward trend) or Negative (downward trend)
- **Strength**: How closely do points follow a pattern?
- **Form**: Linear (straight line) or Non-linear (curved)

**Common relationships:**
- Study hours vs Exam scores (positive)
- Price vs Demand (negative)
- Height vs Weight (positive)
- Age vs Reaction time (non-linear)

## Problem Statement

Draw a scatter diagram for the given data:

**x = [10, 15, 20, 25, 30, 35, 40]**

**y = [8, 18, 10, 25, 22, 30, 28]**

In [None]:
import matplotlib.pyplot as plt

# Example data
x = [10, 15, 20, 25, 30, 35, 40]
y = [8, 18, 10, 25, 22, 30, 28]

# Create scatter plot
plt.figure(figsize=(7, 5))
plt.scatter(x, y, color='blue', marker='o')

# Add labels and title
plt.title('Scatter Plot Example')
plt.xlabel('X-axis (e.g., Study Hours)')
plt.ylabel('Y-axis (e.g., Exam Scores)')

# Show grid and plot
plt.grid(True)
plt.show()

### Understanding the Plot

Each point (x, y) represents one observation:
- Point (10, 8): When x=10, y=8
- Point (15, 18): When x=15, y=18
- And so on...

Looking at the overall pattern, we can see if there's a relationship between x and y.

### 📝 TO DO #1: Change the Data

Try creating scatter plots with different relationships. Uncomment one dataset at a time:

In [None]:
# TO DO: Uncomment one dataset at a time and run the cell

# 1. Strong positive relationship (as one increases, other increases)
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

# 2. Strong negative relationship (as one increases, other decreases)
# x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# y = [20, 18, 16, 14, 12, 10, 8, 6, 4, 2]

# 3. No relationship (random scatter)
# x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# y = [5, 12, 3, 18, 7, 9, 15, 2, 11, 6]

# 4. Non-linear relationship (quadratic)
# x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# y = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]  # y = x²

plt.figure(figsize=(7, 5))
plt.scatter(x, y, color='red', marker='o', s=50)
plt.title('Scatter Plot - Different Relationships')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.grid(True)
plt.show()

print("What type of relationship do you see?")
print("Is it positive, negative, or no relationship?")
print("Is it linear or curved?")

### Adding a Trend Line

A **trend line** (or line of best fit) helps us see the general pattern in the data. It's especially useful for linear relationships.

In [None]:
import numpy as np

# Original data
x = [10, 15, 20, 25, 30, 35, 40]
y = [8, 18, 10, 25, 22, 30, 28]

# Calculate trend line
z = np.polyfit(x, y, 1)  # 1 means linear (degree 1 polynomial)
p = np.poly1d(z)

# Create the plot
plt.figure(figsize=(8, 6))
plt.scatter(x, y, color='blue', marker='o', s=100, label='Data points')
plt.plot(x, p(x), 'r--', label=f'Trend line: y = {z[0]:.2f}x + {z[1]:.2f}')

plt.title('Scatter Plot with Trend Line')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.grid(True)
plt.show()

print(f"Trend line equation: y = {z[0]:.2f}x + {z[1]:.2f}")
print(f"This means: For every 1 unit increase in x, y increases by {z[0]:.2f} units")

### 📝 TO DO #2: Customize the Scatter Plot

Change different properties of the scatter plot to see their effects:

In [None]:
x = [10, 15, 20, 25, 30, 35, 40]
y = [8, 18, 10, 25, 22, 30, 28]

plt.figure(figsize=(8, 6))

# TO DO: Try changing these parameters:
# marker: 'o', 's', '^', 'D', '*', '+', 'x'
# color: 'blue', 'red', 'green', 'purple', 'orange'
# s (size): 50, 100, 200, 300

plt.scatter(x, y,
           color='green',    # TO DO: Change color
           marker='s',       # TO DO: Change marker shape (s=square)
           s=150,           # TO DO: Change size
           alpha=0.6,       # TO DO: Change transparency (0=transparent, 1=solid)
           edgecolor='black', # TO DO: Change edge color
           linewidth=2)     # TO DO: Change edge width

plt.title('Customized Scatter Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.grid(True, alpha=0.3)
plt.show()

### Correlation Coefficient

The **correlation coefficient (r)** measures the strength and direction of a linear relationship:
- **r = +1**: Perfect positive correlation
- **r = 0**: No linear correlation
- **r = -1**: Perfect negative correlation

**Interpretation:**
- |r| > 0.7: Strong relationship
- 0.4 < |r| < 0.7: Moderate relationship
- |r| < 0.4: Weak relationship

In [None]:
from scipy import stats

# Calculate correlation coefficient
x = np.array([10, 15, 20, 25, 30, 35, 40])
y = np.array([8, 18, 10, 25, 22, 30, 28])

correlation, p_value = stats.pearsonr(x, y)

print(f"Correlation coefficient (r): {correlation:.3f}")

# Interpret the correlation
if abs(correlation) > 0.7:
    strength = "Strong"
elif abs(correlation) > 0.4:
    strength = "Moderate"
else:
    strength = "Weak"

if correlation > 0:
    direction = "positive"
elif correlation < 0:
    direction = "negative"
else:
    direction = "no"

print(f"Interpretation: {strength} {direction} relationship")
print(f"\nNote: Correlation does NOT imply causation!")

### 📝 TO DO #3: Real-World Example

Create a scatter plot for a real-world scenario. Choose one example below or create your own:

In [None]:
# TO DO: Choose one example or create your own

# Example 1: Study Hours vs Test Scores
study_hours = [1, 2, 3, 4, 5, 6, 7, 8]
test_scores = [55, 60, 65, 70, 75, 85, 88, 92]
x_label = "Study Hours per Week"
y_label = "Test Score (%)"
title = "Study Time vs Test Performance"

# Example 2: Temperature vs Ice Cream Sales
# temperature = [15, 18, 20, 22, 25, 28, 30, 32]
# ice_cream_sales = [20, 25, 30, 35, 45, 60, 75, 85]
# x_label = "Temperature (°C)"
# y_label = "Ice Cream Sales (units)"
# title = "Temperature vs Ice Cream Sales"

# Example 3: Your own data
# x_data = [...]  # TO DO: Add your x values
# y_data = [...]  # TO DO: Add your y values
# x_label = "Your X Variable"
# y_label = "Your Y Variable"
# title = "Your Title"

# Create the scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(study_hours, test_scores, color='purple', s=100)

# Add trend line
z = np.polyfit(study_hours, test_scores, 1)
p = np.poly1d(z)
plt.plot(study_hours, p(study_hours), 'r--', alpha=0.8)

# Labels and title
plt.title(title)
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.grid(True, alpha=0.3)
plt.show()

# Calculate and show correlation
r, _ = stats.pearsonr(study_hours, test_scores)
print(f"Correlation: r = {r:.3f}")
print(f"Trend line: {y_label} = {z[0]:.2f} × {x_label} + {z[1]:.2f}")

### 📝 TO DO #4: Multiple Groups in One Plot

Sometimes we want to compare different groups on the same scatter plot:

In [None]:
# Data for two different classes
# Class A - Regular study pattern
hours_A = [1, 2, 3, 4, 5, 6, 7, 8]
scores_A = [50, 55, 60, 65, 70, 75, 80, 85]

# Class B - Irregular study pattern
hours_B = [1, 2, 3, 4, 5, 6, 7, 8]
scores_B = [45, 52, 58, 68, 72, 78, 82, 88]

# TO DO: Try adding a third class
# hours_C = [1, 2, 3, 4, 5, 6, 7, 8]
# scores_C = [40, 48, 56, 64, 72, 80, 88, 96]  # Steeper improvement

plt.figure(figsize=(10, 6))

# Plot each class with different colors
plt.scatter(hours_A, scores_A, color='blue', label='Class A', s=100, alpha=0.7)
plt.scatter(hours_B, scores_B, color='red', label='Class B', s=100, alpha=0.7)
# plt.scatter(hours_C, scores_C, color='green', label='Class C', s=100, alpha=0.7)

plt.title('Study Hours vs Test Scores - Multiple Classes')
plt.xlabel('Study Hours per Week')
plt.ylabel('Test Score (%)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Compare correlations
r_A, _ = stats.pearsonr(hours_A, scores_A)
r_B, _ = stats.pearsonr(hours_B, scores_B)

print(f"Class A correlation: r = {r_A:.3f}")
print(f"Class B correlation: r = {r_B:.3f}")
# print(f"Class C correlation: r = {r_C:.3f}")

## Summary

### Key Concepts:

1. **Scatter plots** show relationships between two numerical variables
2. **Each point** represents one observation (x, y pair)
3. **Patterns** can be positive, negative, or no relationship
4. **Trend lines** help visualize the general pattern
5. **Correlation coefficient (r)** measures strength of linear relationship

### Important to Remember:

- **Correlation ≠ Causation**: Just because two things are related doesn't mean one causes the other
- **Outliers** can strongly affect correlation
- **Non-linear relationships** might have low correlation even if there's a strong pattern

### Practice Questions:

1. If height and weight have r = 0.8, what does this mean?
2. Can you have a strong relationship with r = 0?
3. What would a scatter plot look like if r = -1?
4. Give an example of correlation without causation.

### Extra: Using Plotly for Interactive Plots

Plotly creates interactive plots where you can hover over points to see their values:

In [None]:
# Optional: Interactive plot with Plotly
# Uncomment to use if Plotly is installed

# import plotly.express as px
# import pandas as pd

# df = pd.DataFrame({
#     'Study Hours': [1, 2, 3, 4, 5, 6, 7, 8],
#     'Test Score': [55, 60, 65, 70, 75, 85, 88, 92]
# })

# fig = px.scatter(df, x='Study Hours', y='Test Score',
#                  title='Interactive Scatter Plot',
#                  trendline='ols')  # Add trend line
# fig.show()