# Lab Instructions

Find a dataset that interests you. I'd recommend starting on [Kaggle](https://www.kaggle.com/). Read through all of the material about the dataset and download a .CSV file.

1. Write a short summary of the data.  Where did it come from?  How was it collected?  What are the features in the data?  Why is this dataset interesting to you?  

2. Identify 5 interesting questions about your data that you can answer using Pandas methods.  

3. Answer those questions!  You may use any method you want (including LLMs) to help you write your code; however, you should use Pandas to find the answers.  LLMs will not always write code in this way without specific instruction.  

4. Write the answer to your question in a text box underneath the code you used to calculate the answer.



Question 1: Do students who participate in extracurricular activities perform better on average?

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/MALICIOUS/Downloads/archive/StudentPerformance.csv")

In [3]:
print("Rows, Columns:", df.shape)
print(df.head())
print("\nColumns:", df.columns.tolist())

Rows, Columns: (10000, 6)
   Hours Studied  Previous Scores Extracurricular Activities  Sleep Hours  \
0              7               99                        Yes            9   
1              4               82                         No            4   
2              8               51                        Yes            7   
3              5               52                        Yes            5   
4              7               75                         No            8   

   Sample Question Papers Practiced  Performance Index  
0                                 1               91.0  
1                                 2               65.0  
2                                 2               45.0  
3                                 2               36.0  
4                                 5               66.0  

Columns: ['Hours Studied', 'Previous Scores', 'Extracurricular Activities', 'Sleep Hours', 'Sample Question Papers Practiced', 'Performance Index']


In [4]:
# Question 1: Do students who do extracurriculars perform better on average?
print("\n--- Q1: Average performance by extracurricular activities ---")
q1 = df.groupby("Extracurricular Activities")["Performance Index"].mean().sort_index()
print(q1)
print("\nAnswer:")
if "Yes" in q1.index and "No" in q1.index:
    diff = q1["Yes"] - q1["No"]
    print(f"Students with extracurriculars average {diff:.2f} points {'higher' if diff>0 else 'lower'} than those without.")
else:
    print("Extracurricular categories found:", q1.index.tolist())


--- Q1: Average performance by extracurricular activities ---
Extracurricular Activities
No     54.758511
Yes    55.700889
Name: Performance Index, dtype: float64

Answer:
Students with extracurriculars average 0.94 points higher than those without.


In [5]:
# Question 2: Is there a relationship between hours studied and performance?
print("\n--- Q2: Correlation between hours studied and performance ---")
q2 = df["Hours Studied"].corr(df["Performance Index"])
print("Correlation:", q2)
print("\nAnswer:")
print(f"There is a correlation of {q2:.3f} between Hours Studied and Performance Index (positive means higher study time tends to relate to higher performance).")



--- Q2: Correlation between hours studied and performance ---
Correlation: 0.37373035069872373

Answer:
There is a correlation of 0.374 between Hours Studied and Performance Index (positive means higher study time tends to relate to higher performance).


In [10]:
# Q3: How does sleep affect performance?
df["Sleep Category"] = pd.cut(
    df["Sleep Hours"],
    bins=[-0.01, 5, 7, np.inf],
    labels=["Low Sleep (≤5 hrs)", "Moderate Sleep (6–7 hrs)", "High Sleep (8+ hrs)"]
)

q3 = df.groupby("Sleep Category", observed=True)["Performance Index"].mean().round(2)

print("\nAverage Performance by Sleep Category:")
print(q3)

print("\nAnswer:")
best_sleep = q3.idxmax()
best_value = q3.max()
print(f"Students in the '{best_sleep}' group have the highest average performance ({best_value}).")



Average Performance by Sleep Category:
Sleep Category
Low Sleep (≤5 hrs)          54.30
Moderate Sleep (6–7 hrs)    54.97
High Sleep (8+ hrs)         56.35
Name: Performance Index, dtype: float64

Answer:
Students in the 'High Sleep (8+ hrs)' group have the highest average performance (56.35).


In [11]:
# Q4: Do previous scores predict performance?
q4 = df["Previous Scores"].corr(df["Performance Index"])

print("\nCorrelation between Previous Scores and Performance:")
print(round(q4, 3))

print("\nAnswer:")
print(
    f"There is a strong positive correlation ({q4:.3f}), "
    "meaning students who scored higher previously tend to perform better."
)



Correlation between Previous Scores and Performance:
0.915

Answer:
There is a strong positive correlation (0.915), meaning students who scored higher previously tend to perform better.


In [12]:
# Q5: Does practicing more sample question papers improve performance?
q5 = (
    df.groupby("Sample Question Papers Practiced")["Performance Index"]
      .mean()
      .round(2)
)

print("\nAverage Performance by Number of Practice Papers:")
print(q5)

print("\nAnswer:")
best_papers = q5.idxmax()
best_score = q5.max()

print(
    f"Performance generally increases with more practice papers, "
    f"with the highest average score ({best_score}) at {best_papers} papers."
)



Average Performance by Number of Practice Papers:
Sample Question Papers Practiced
0    52.95
1    54.61
2    55.26
3    55.26
4    54.15
5    55.45
6    56.15
7    55.78
8    55.45
9    56.88
Name: Performance Index, dtype: float64

Answer:
Performance generally increases with more practice papers, with the highest average score (56.88) at 9 papers.
