**Title**: Understanding the Dataset <br>

**Task**: For the given datasets below, identify the data types and dimensions.<br>

Task 1: Employee Dataset<br>
Columns: Employee ID , Name , Age , Department , Salary , Joining Date<br>
Task 2: Product Sales Dataset<br>
Columns: Product ID , Product Name , Price , Quantity Sold , Category , Sales Date<br>
Task 3: Student Grades Dataset<br>
Columns: Student ID , Student Name , Math Score , Science Score , English Score ,Year<br>

Instructions:<br>

Identify which columns are numerical (continuous or discrete) and which are categorical
(nominal or ordinal).<br>
Note down the dimensions (number of rows and columns) of the dataset.

In [20]:
# Write your code from here


**Title**: Checking for Missing Values<br>

**Task**: Identify and count the number of missing values in each dataset.<br>

Instructions:<br>
Use Python or any data manipulation tool to check for missing values in each column of the datasets. <br>Report the columns which have missing values and their counts.

In [None]:
# Write your code from here
import pandas as pd
import numpy as np

# Let's create a sample DataFrame with some missing values
data = {'col1': [1, 2, np.nan, 4, 5],
        'col2': ['a', np.nan, 'c', 'd', 'e'],
        'col3': [10.5, 20.3, 15.0, np.nan, 12.8],
        'col4': [True, False, True, True, np.nan]}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
print("\n")

# Check for missing values in each column
missing_values = df.isnull().sum()

# Report columns with missing values and their counts
print("Missing Values per Column:")
print(missing_values[missing_values > 0])

# Alternatively, to see all columns and their missing value counts:
print("\nMissing Values in All Columns:")
print(missing_values)

**Title**: Handling Outliers<br>

**Task**: Detect and propose handling methods for outliers in the numerical columns of the datasets.<br>

Task 1: Age in Employee Dataset<br>
Task 2: Price in Product Sales Dataset<br>
Task 3: Math Score in Student Grades Dataset<br>

Instructions:<br>

Use box plots to visualize potential outliers.<br>
Suggest methods to handle them, such as removal or transformation.

In [22]:
# Write your code from here


**Title**: Visualizing Data Distributions<br>

**Task**: Create visualizations for data distributions.<br>

Task 1: Histogram for Age in Employee Dataset<br>
Task 2: Distribution plot for Price in Product Sales Dataset<br>
Task 3: Histogram for Math Score in Student Grades Dataset

Instructions:<br>

Use matplotlib or seaborn in Python to create the plots.<br>
Comment on the skewness or normality of the distributions.

In [None]:
# Write your code from here
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
print("--- Task 1: Age in Employee Dataset ---")
employee_ages = pd.Series([25, 30, 22, 35, 28, 40, 27, 32, 65, 29, 31, 70])
df_employee = pd.DataFrame({'Age': employee_ages})
plt.figure(figsize=(8, 6))
sns.boxplot(x=df_employee['Age'])
plt.title('Box Plot of Employee Ages')
plt.xlabel('Age')
plt.show()
Q1 = df_employee['Age'].quantile(0.25)
Q3 = df_employee['Age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_age = df_employee[(df_employee['Age'] < lower_bound) | (df_employee['Age'] > upper_bound)]
print("\nPotential Outliers in Employee Age:")
print(outliers_age)

print("\nSuggested Handling Methods for Age Outliers:")
print("- **Removal:** If the outliers are due to data entry errors or represent truly exceptional cases not relevant to the analysis.")
print("- **Capping/Winsorizing:** Replace outlier values with a predefined upper or lower limit (e.g., 1st and 99th percentiles).")
print("- **Transformation:** Logarithmic transformation might reduce the impact of extreme values.")
print("- **Keep as is:** If the extreme ages are valid and important for the analysis.")
print("\n--- Task 2: Price in Product Sales Dataset ---")
product_prices = pd.Series([10, 15, 12, 20, 18, 25, 16, 30, 5, 100, 22, 14, 110])
df_sales = pd.DataFrame({'Price': product_prices})
plt.figure(figsize=(8, 6))
sns.boxplot(x=df_sales['Price'])
plt.title('Box Plot of Product Prices')
plt.xlabel('Price')
plt.show()
Q1_price = df_sales['Price'].quantile(0.25)
Q3_price = df_sales['Price'].quantile(0.75)
IQR_price = Q3_price - Q1_price
lower_bound_price = Q1_price - 1.5 * IQR_price
upper_bound_price = Q3_price + 1.5 * IQR_price
outliers_price = df_sales[(df_sales['Price'] < lower_bound_price) | (df_sales['Price'] > upper_bound_price)]
print("\nPotential Outliers in Product Price:")
print(outliers_price)

print("\nSuggested Handling Methods for Price Outliers:")
print("- **Investigation:** Determine if high prices are due to premium products or data entry errors.")
print("- **Segmentation:** Analyze high-priced items separately if they represent a distinct product segment.")
print("- **Capping:** If very high prices are likely errors, cap them at a reasonable maximum.")
print("- **Transformation:** Logarithmic transformation can help with skewed price distributions.")
print("- **Removal:** Consider removal if outliers are confirmed errors and not representative.")
print("\n--- Task 3: Math Score in Student Grades Dataset ---")
math_scores = pd.Series([75, 80, 85, 90, 78, 92, 88, 60, 82, 95, 105, 70, 65, 20])
df_grades = pd.DataFrame({'Math Score': math_scores})
plt.figure(figsize=(8, 6))
sns.boxplot(x=df_grades['Math Score'])
plt.title('Box Plot of Math Scores')
plt.xlabel('Math Score')
plt.show()
Q1_score = df_grades['Math Score'].quantile(0.25)
Q3_score = df_grades['Math Score'].quantile(0.75)
IQR_score = Q3_score - Q1_score
lower_bound_score = Q1_score - 1.5 * IQR_score
upper_bound_score = Q3_score + 1.5 * IQR_score
outliers_score = df_grades[(df_grades['Math Score'] < lower_bound_score) | (df_grades['Math Score'] > upper_bound_score)]
print("\nPotential Outliers in Math Score:")
print(outliers_score)

print("\nSuggested Handling Methods for Math Score Outliers:")
print("- **Investigation:** Understand the reasons for very low or unexpectedly high scores (e.g., errors, special circumstances).")
print("- **Capping/Flooring:** Limit scores to a plausible range (e.g., 0-100).")
print("- **Contextual Consideration:** A very low score might be valid and important to keep.")
print("- **Transformation:** Less common for scores, but might be considered in specific scenarios.")
print("- **Imputation (with caution):** If a score is clearly an error, consider imputation with a plausible value.")

**Title**: Finding Relationships Between Features<br>

**Task**: Identify relationships between pairs of features in the datasets.<br>

Task 1: Salary vs Age in Employee Dataset<br>
Task 2: Price vs Quantity Sold in Product Sales Dataset<br>
Task 3: Math Score vs Science Score in Student Grades Dataset

Instructions: <br>

Use scatter plots or correlation coefficients to analyze the relationships.<br>
Describe any insights or patterns observed.

In [None]:
# Write your code from here
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

print("--- Task 1: Salary vs Age in Employee Dataset ---")

np.random.seed(0)
n_employees = 100
ages = np.random.randint(22, 60, n_employees)
salaries = 30000 + 1000 * ages + np.random.normal(0, 10000, n_employees)
df_employee = pd.DataFrame({'Age': ages, 'Salary': salaries})

plt.figure(figsize=(8, 6))
sns.scatterplot(x='Age', y='Salary', data=df_employee)
plt.title('Salary vs Age in Employee Dataset')
plt.xlabel('Age (Years)')
plt.ylabel('Salary (USD)')
plt.grid(True)
plt.show()

correlation_age_salary = df_employee['Age'].corr(df_employee['Salary'])
print(f"\nCorrelation Coefficient between Age and Salary: {correlation_age_salary:.2f}")

print("\nInsights and Patterns Observed (Salary vs Age):")
print("- Generally positive trend: as age increases, salary tends to increase.")
print("- Correlation coefficient indicates a moderate positive linear relationship.")
print("- Some variability in salary for the same age is expected.")

print("\n--- Task 2: Price vs Quantity Sold in Product Sales Dataset ---")
np.random.seed(1)
n_products = 150
prices = np.random.uniform(5, 50, n_products)
quantity_sold = 100 - 1.5 * prices + np.random.normal(0, 20, n_products)
quantity_sold[quantity_sold < 0] = 0
df_sales = pd.DataFrame({'Price': prices, 'Quantity Sold': quantity_sold})

plt.figure(figsize=(8, 6))
sns.scatterplot(x='Price', y='Quantity Sold', data=df_sales)
plt.title('Price vs Quantity Sold in Product Sales Dataset')
plt.xlabel('Price (USD)')
plt.ylabel('Quantity Sold')
plt.grid(True)
plt.show()

correlation_price_quantity = df_sales['Price'].corr(df_sales['Quantity Sold'])
print(f"\nCorrelation Coefficient between Price and Quantity Sold: {correlation_price_quantity:.2f}")

print("\nInsights and Patterns Observed (Price vs Quantity Sold):")
print("- Generally negative trend: as price increases, quantity sold tends to decrease.")
print("- Correlation coefficient indicates a moderate negative linear relationship.")
print("- Some products deviate from this trend due to other factors.")

print("\n--- Task 3: Math Score vs Science Score in Student Grades Dataset ---")

np.random.seed(2)
n_students = 120
math_scores = np.random.randint(50, 100, n_students)
science_scores = 0.8 * math_scores + 10 + np.random.normal(0, 8, n_students)
science_scores[science_scores > 100] = 100
science_scores[science_scores < 0] = 0
df_grades = pd.DataFrame({'Math Score': math_scores, 'Science Score': science_scores})

plt.figure(figsize=(8, 6))
sns.scatterplot(x='Math Score', y='Science Score', data=df_grades)
plt.title('Math Score vs Science Score in Student Grades Dataset')
plt.xlabel('Math Score')
plt.ylabel('Science Score')
plt.grid(True)
plt.show()

correlation_math_science = df_grades['Math Score'].corr(df_grades['Science Score'])
print(f"\nCorrelation Coefficient between Math Score and Science Score: {correlation_math_science:.2f}")

print("\nInsights and Patterns Observed (Math Score vs Science Score):")
print("- Positive trend: students with higher math scores tend to have higher science scores.")
print("- Correlation coefficient indicates a strong positive linear relationship.")
print("- Suggests potential underlying factors influencing performance in both subjects.")