# **Introduction to Panel Data**

### **What is Panel Data?**

- **Definition:** Panel data, or longitudinal data, combines cross-sectional and time-series data. It tracks multiple subjects over time, providing insights into changes and trends.

- **Key Characteristics:**
Multiple subjects (e.g., individuals, firms).
Observations recorded at multiple time points.
Allows analysis of dynamics over time, accounting for both individual differences and temporal effects.


#### **1. Time Series Data Example:**
Time series data captures changes in a single entity over multiple time periods. In the context of our educational case study:

- **Example:** The mathematics scores of a single student (Alice) recorded over three academic years (2021, 2022, 2023).

| Year | Student | Score|
|------|---------|------|
| 2021 | Alice | 75 |
| 2022 | Alice | 80 |
| 2023 | Alice | 90 |

### **2. Cross-Sectional Data Example:**
Cross-sectional data captures multiple entities at a single point in time. In the context of our case study:

- **Example:** The mathematics scores of multiple students in a single academic year (2021).

| Year | Student  | Score |
|------|----------|-------|
| 2021 | Alice    | 75    |
| 2021 | Bob      | 70    |
| 2021 | Charlie  | 68    |

### **1. Where is Panel Data Used?**
- **Applications in Various Fields**
Economics: GDP growth, employment statistics.
Health Studies: Tracking health outcomes, effectiveness of treatments.
Finance: Stock performance, company financials.
Social Sciences: Behavioral changes, shifts in public opinion.
- **Examples**
Education: Monitoring students’ performance over time to evaluate educational interventions.
Public Policy: Assessing the impact of legislation on crime rates across various cities over time.

### **2. Setting Up the Environment**
Installing Required Libraries

 - Make sure to install necessary libraries:

In [None]:
pip install pandas matplotlib seaborn

#### Importing Libraries

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


### **3. Creating a Panel DataFrame**
#### Sample Dataset

- Let's create a more detailed DataFrame with multiple subjects and years.

In [3]:
# Sample data for multiple students over multiple years
data = {
    'Student': ['Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob', 'Charlie', 'Charlie', 'Charlie', 'Alice', 'Bob', 'Charlie'],
    'Year': [2021, 2022, 2023, 2021, 2022, 2023, 2021, 2022, 2023, 2024, 2024, 2024],
    'Score': [85, 90, 95, 80, 85, 88, 78, 82, 85, 92, 84, 89],
    'Attendance': [90, 92, 95, 85, 88, 87, 80, 82, 81, 93, 90, 85]
}

panel_data = pd.DataFrame(data)


### **4. Exploring the Panel Data**
#### Viewing the Data

- Examine the dataset with various functions.


In [None]:
# Displaying the entire dataset
print(panel_data)

In [None]:
# Display the first few rows
print(panel_data.head())

In [None]:
# Summary statistics for numerical columns
print(panel_data.describe())

In [None]:
# Information about the DataFrame
print(panel_data.info())

### **5. Basic Operations on Panel Data**
#### 5.1. Grouping and Aggregating Data
- Calculate average scores and attendance by student.

In [None]:
# Average scores and attendance per student
average_scores = panel_data.groupby('Student').agg({'Score': 'mean', 'Attendance': 'mean'})
print(average_scores)


#### 5.2. Filtering Data
- How to filter for a specific year or student.

In [None]:
# Filtering data for 2024
data_2024 = panel_data[panel_data['Year'] == 2024]
print(data_2024)

In [None]:
# Filtering data for Alice
alice_data = panel_data[panel_data['Student'] == 'Alice']
print(alice_data)

### 5.3. Pivoting Data
- Create a pivot table for scores and attendance.

In [None]:
# Pivot table for scores
pivot_scores = panel_data.pivot(index='Year', columns='Student', values='Score')
print(pivot_scores)

In [None]:
# Pivot table for attendance
pivot_attendance = panel_data.pivot(index='Year', columns='Student', values='Attendance')
print(pivot_attendance)

#### 5.4. Additional Operations
- Count the number of entries per student and year.

In [None]:
# Counting entries per student
entry_counts = panel_data['Student'].value_counts()
print(entry_counts)

In [None]:
# Counting entries per year
year_counts = panel_data['Year'].value_counts()
print(year_counts)

#### 5.5. Handling Missing Data
- Introduce how to handle missing values.

In [None]:
# Introducing a missing value for demonstration
panel_data.loc[4, 'Score'] = None  # Introduce a missing value

# Handling missing values by filling with the mean score
panel_data['Score'].fillna(panel_data['Score'].mean(), inplace=True)
print(panel_data)

#### 5.6. Calculating Growth Rates
- Calculate the growth rate of scores from year to year for each student.

In [None]:
# Sorting data to ensure the correct order
panel_data.sort_values(by=['Student', 'Year'], inplace=True)
# Calculating the growth rate of scores
panel_data['Score Growth'] = panel_data.groupby('Student')['Score'].pct_change() * 100
print(panel_data)

### **6. Visualizing Panel Data**
#### Plotting Scores and Attendance Over Time
- Visualize the changes for each student over the years.

In [None]:
# Setting the style for the plots
sns.set(style="whitegrid")

# Plotting scores
plt.figure(figsize=(10, 5))
for student in panel_data['Student'].unique():
    subset = panel_data[panel_data['Student'] == student]
    plt.plot(subset['Year'], subset['Score'], marker='o', label=student)

plt.title('Students\' Scores Over Years')
plt.xlabel('Year')
plt.ylabel('Score')
plt.legend()
plt.grid()
plt.show()

### Heatmap of Attendance
- Create a heatmap to visualize attendance over years

In [None]:
# Creating a heatmap for attendance
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_attendance, annot=True, cmap='YlGnBu', fmt=".1f")
plt.title('Attendance Heatmap')
plt.xlabel('Student')
plt.ylabel('Year')
plt.show()

## **7. Case Study: Evaluating the Impact of a New Teaching Method on Student Performance**
### **Background**
In an educational setting, a new interactive teaching method was implemented to improve students' academic performance in mathematics. This method involved group discussions, hands-on activities, and the use of technology to enhance learning. The goal of this study is to assess whether this new method had a significant impact on student scores over a period of three years.

### **Objective**
To evaluate the effectiveness of the new teaching method by comparing students' mathematics scores before and after its implementation.

### **Data Collection**
Data was collected from three different classes of students over three academic years (2021, 2022, and 2023). The dataset includes the following variables:

- **Student**: Name of the student.
- **Year**: Academic year (2021, 2022, 2023).
- **Score**: Mathematics score (out of 100).
- **Method**: Indicates whether the student was taught using the new method (1 for yes, 0 for no).

### **Sample Dataset**
Here’s a sample of the dataset you might use:

In [21]:
import pandas as pd

# Sample dataset
data = {
    'Student': ['Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob', 'Charlie', 'Charlie', 'Charlie'],
    'Year': [2021, 2022, 2023, 2021, 2022, 2023, 2021, 2022, 2023],
    'Score': [75, 80, 90, 70, 75, 85, 68, 72, 78],
    'Method': [0, 0, 1, 0, 0, 1, 0, 0, 1]  # 0: Traditional, 1: New Method
}

panel_data = pd.DataFrame(data)

### **Analysis Approach**
**1. Descriptive Statistics:** Start by examining the mean scores for each year and method.

In [None]:
# Calculate mean scores by year and teaching method
mean_scores = panel_data.groupby(['Year', 'Method'])['Score'].mean().reset_index()
print(mean_scores)

**2. Visualizing the Data:** Create plots to visualize changes in scores over the years for both teaching methods.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 5))
sns.lineplot(data=panel_data, x='Year', y='Score', hue='Method', marker='o')
plt.title('Student Scores by Teaching Method Over Years')
plt.xlabel('Year')
plt.ylabel('Average Score')
plt.legend(['Traditional', 'New Method'])
plt.grid()
plt.show()

**3. Statistical Testing:** 
Conduct a statistical test (e.g., t-test) to determine if the differences in scores between the two methods are statistically significant.

In [None]:
from scipy import stats

traditional_scores = panel_data[panel_data['Method'] == 0]['Score']
new_method_scores = panel_data[panel_data['Method'] == 1]['Score']

t_stat, p_value = stats.ttest_ind(traditional_scores, new_method_scores)
print(f"T-statistic: {t_stat}, P-value: {p_value}")


### **8. Conclusion on Using Panel Data**
Panel data provides a powerful tool for analyzing how variables change over time across different subjects or entities. Unlike cross-sectional data, which captures a single snapshot in time, or time-series data, which focuses on one entity over time, panel data combines both, allowing for a more nuanced analysis. By tracking individuals, companies, or other entities across multiple periods, researchers and analysts can better understand trends, behaviors, and causal relationships. It helps in controlling for variables that change over time and offers more informative, efficient, and reliable inferences.

#### **Benefits of Panel Data:**
1. **Improved Accuracy:** By tracking multiple observations for each entity, panel data increases the ability to detect and measure dynamics that might not be visible in simple cross-sectional data.
2. **Causal Inferences:** Panel data allows researchers to better identify cause-and-effect relationships because they can observe changes over time.
3. **Handling of Unobserved Heterogeneity:** Panel data controls for individual differences that could otherwise bias results.
### **Summary of Session:**
In today's session, we explored the basics of panel data analysis using Python, focusing on both the theoretical and practical aspects. We learned how to handle panel data using the pandas library and applied various basic operations like descriptive statistics, data transformation, and visualization. We also examined a case study to evaluate the impact of a teaching method on student performance, providing real-world context to panel data's application.

#### **Key Takeaways:**

- Panel data combines time-series and cross-sectional data, allowing for richer analysis.
- Operations like grouping, calculating statistics, and visualizing data trends are easy to implement with pandas.
- Panel data can provide better insights for longitudinal studies and support decision-making processes across fields such as economics, education, and social sciences.

With the case study, we also saw how data-driven approaches can lead to valuable insights in educational settings, allowing us to measure the effect of new teaching methods over time.

I encourage you to experiment further with your own datasets and apply these techniques to deepen your understanding of panel data analysis.