## Open Book Exam Templates for MGT301

### Section 1: Conceptual Questions

1. **General Understanding**:
- Explain the difference between `DataFrame` and `Series` in pandas. Provide an example of when you would use each.
- Discuss three methods to handle outliers in a dataset. Provide an example for each.

2. **Data Cleaning**:
- What is the importance of handling missing values in data preprocessing? Illustrate with examples.
- Describe the process of using the `z-score` method for identifying outliers. What are the advantages and disadvantages of this method?



## Section 1: Conceptual Questions

### General Understanding

1. **Explain the difference between DataFrame and Series in pandas. Provide an example of when you would use each.**
   - A **Series** is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, etc.). It's essentially a single column of data.
   - A **DataFrame** is a two-dimensional labeled data structure, which can be thought of as a collection of Series sharing the same index.

   **Example:**
   - Use a **Series** when working with a single column of data, like stock prices over time.
   - Use a **DataFrame** when working with multiple columns, like stock prices and volumes over time.

2. **Discuss three methods to handle outliers in a dataset. Provide an example for each.**
   - **Trimming:** Remove data points beyond a specific threshold (e.g., outside 1.5 IQR).
   - **Capping:** Replace outliers with the nearest boundary values (e.g., 5th and 95th percentiles).
   - **Transformations:** Apply transformations (e.g., logarithmic) to reduce the effect of outliers.

---

### Data Cleaning

3. **What is the importance of handling missing values in data preprocessing? Illustrate with examples.**
   - Missing values can distort analysis and lead to inaccurate models. Strategies include:
     - **Removing rows/columns with too many missing values.**
     - **Imputing with mean, median, or mode.**
     - **Using predictive models for imputation.**

4. **Describe the process of using the z-score method for identifying outliers. What are the advantages and disadvantages of this method?**
   - Calculate the z-score for each data point.
   - Identify outliers as points where the z-score is greater than a threshold (e.g., ±3).
   - **Advantages:** Simple and interpretable.
   - **Disadvantages:** Assumes normal distribution and can be influenced by extreme values.

---

### Descriptive Statistics

5. **Define skewness and kurtosis. What do they indicate about a dataset?**
   - **Skewness:** Measures asymmetry in the distribution.
     - Positive skew: Tail is on the right.
     - Negative skew: Tail is on the left.
   - **Kurtosis:** Measures the "tailedness" of the distribution.
     - High kurtosis: Heavy tails (outliers).
     - Low kurtosis: Light tails.

6. **Explain the difference between variance and covariance. Provide a scenario where each is relevant.**
   - **Variance:** Measures the spread of a single variable.
     - Relevant for understanding variability in stock prices.
   - **Covariance:** Measures how two variables change together.
     - Relevant for assessing the relationship between two stock prices.

---

### Grouping and Aggregation

7. **Explain the purpose of the `groupby` function in pandas. Provide an example of its usage.**
   - Groups data based on a key and applies operations (e.g., sum, mean).
   ```python
   df.groupby('Category').sum()


### Section 4: Coding Templates

#### Template 1: DataFrame Creation

In [2]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Department': ['HR', 'IT', 'Finance']
}

# Create DataFrame
df = pd.DataFrame(data)
print(df)

      Name  Age Department
0    Alice   25         HR
1      Bob   30         IT
2  Charlie   35    Finance


#### Template 2: Handling Missing Values

In [3]:
# Fill missing values
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', None],
    'Age': [25, None, 35]
}

df = pd.DataFrame(data)

# Fill missing values with default values
df['Name'].fillna('Unknown', inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)

      Name   Age
0    Alice  25.0
1      Bob  30.0
2  Unknown  35.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Name'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)


#### Template 3: Outlier Removal Using IQR

In [4]:
import numpy as np
import pandas as pd

data = {
    'Value': [10, 15, 14, 12, 200, 13, 17]
}

df = pd.DataFrame(data)

# Calculate IQR
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1

# Filter out outliers
filtered_df = df[(df['Value'] >= (Q1 - 1.5 * IQR)) & (df['Value'] <= (Q3 + 1.5 * IQR))]
print(filtered_df)

   Value
0     10
1     15
2     14
3     12
5     13
6     17


#### Template 4: Data Aggregation

In [5]:
import pandas as pd

data = {
    'Region': ['East', 'West', 'East', 'North'],
    'Sales': [100, 200, 150, 300]
}

df = pd.DataFrame(data)

# Group by and aggregate
grouped = df.groupby('Region')['Sales'].sum()
print(grouped)

Region
East     250
North    300
West     200
Name: Sales, dtype: int64


#### Template 5: Visualization

In [6]:
import matplotlib.pyplot as plt
import pandas as pd

data = {
    'Category': ['A', 'B', 'C'],
    'Values': [10, 20, 15]
}

df = pd.DataFrame(data)

# Bar plot
plt.bar(df['Category'], df['Values'])
plt.xlabel('Category')
plt.ylabel('Values')
plt.title('Category vs Values')
plt.show()

ModuleNotFoundError: No module named 'matplotlib'