Title: Histogram of a Single Feature <br>

Question 1: Create a histogram for the age feature from a dataset. Interpret what the shape of the histogram tells us about the distribution of the age feature.

In [None]:
# Write your code from here
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your dataset is in a pandas DataFrame called 'df'
# and the age feature is in a column named 'age'

# Let's create a sample DataFrame for demonstration purposes
data = {'age': [25, 30, 22, 35, 40, 28, 26, 32, 45, 29, 31, 38, 27, 33, 42, 24, 36, 39, 34, 41,
                50, 55, 48, 60, 52, 58, 49, 53, 62, 51, 23, 37, 43, 31, 29, 57, 46, 33, 59, 44]}
df = pd.DataFrame(data)

# Create the histogram using matplotlib
plt.figure(figsize=(10, 6))
plt.hist(df['age'], bins=10, edgecolor='black', alpha=0.7)
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.5)
plt.show()

# Alternatively, you can use seaborn for a more visually appealing histogram
plt.figure(figsize=(10, 6))
sns.histplot(df['age'], bins=10, kde=True) # kde=True adds a kernel density estimate line
plt.title('Histogram of Age with KDE')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Interpretation of the shape (based on the sample data):
print("\nInterpretation of the histogram shape (based on the sample data):")
print("- The distribution appears to have a few peaks, suggesting it might be multimodal or have some underlying subgroups.")
print("- The ages seem to be spread out, ranging from the early 20s to the early 60s.")
print("- There isn't a strong indication of skewness in either direction based on this sample.")
print("- The presence of multiple smaller peaks could warrant further investigation into potential groupings within the data.")

Title: Boxplot for a Single Feature <br>

Question 2: Generate a boxplot for the salary feature and identify any outliers.

In [None]:
# Write your code from here
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your dataset is in a pandas DataFrame called 'df'
# and the salary feature is in a column named 'salary'

# Let's create a sample DataFrame with some potential outliers
data = {'salary': [50000, 55000, 60000, 62000, 65000, 58000, 70000, 63000, 52000, 68000,
                   90000, 100000, 120000, 40000, 35000, 75000, 80000, 150000, 300000, 72000]}
df = pd.DataFrame(data)

# Generate the boxplot using seaborn
plt.figure(figsize=(8, 6))
sns.boxplot(y=df['salary'])
plt.title('Boxplot of Salary')
plt.ylabel('Salary (in some currency)')
plt.grid(axis='y', alpha=0.5)
plt.show()

# Identifying outliers (visually from the boxplot)
print("\nIdentifying Outliers:")
print("- Outliers are typically represented by individual points that lie significantly above or below the 'whiskers' of the boxplot.")
print("- In the generated boxplot (based on the sample data), you should be able to visually identify any such points.")
print("- Values far beyond the upper and lower whiskers are potential outliers.")

# You can also programmatically identify outliers using the IQR method:
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)]
print("\nProgrammatically identified potential outliers (using IQR method):")
print(outliers)

Title: Violin Plot of a Single Feature <br>

Question 3: Use a violin plot to visualize the distribution of the height feature and comment on its shape.

In [None]:
# Write your code from here
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your dataset is in a pandas DataFrame called 'df'
# and the height feature is in a column named 'height'

# Let's create a sample DataFrame with some height data (in cm)
data = {'height': [165, 170, 158, 175, 180, 162, 168, 172, 155, 178,
                   163, 171, 166, 176, 182, 159, 169, 173, 157, 179,
                   167, 174, 161, 177, 181, 160, 170, 174, 164, 177]}
df = pd.DataFrame(data)

# Generate the violin plot using seaborn
plt.figure(figsize=(6, 8))
sns.violinplot(y=df['height'])
plt.title('Violin Plot of Height')
plt.ylabel('Height (cm)')
plt.grid(axis='y', alpha=0.5)
plt.show()

# Commenting on the shape:
print("\nComments on the shape of the violin plot (based on the sample data):")
print("- The width of the violin plot at different heights indicates the density of the data at those values.")
print("- The wider sections suggest higher frequencies of individuals with heights in those ranges.")
print("- The plot appears to be roughly symmetrical around the central bulge, suggesting a relatively normal-like distribution for this sample.")
print("- The presence of a single main bulge (mode) indicates that the heights are concentrated around a central value.")
print("- The thin extensions (similar to the whiskers of a boxplot) show the spread of the data.")
print("- The small horizontal line inside the wider part represents the median, and the thicker black bar around it usually indicates the interquartile range (IQR), similar to a boxplot.")

Title: Scatter Plot to Analyze Relationship<br>

Question 4: Create a scatter plot for the weight and height features to determine if there is a trend.

In [None]:
# Write your code from here
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your dataset is in a pandas DataFrame called 'df'
# and the weight feature is in a column named 'weight'
# and the height feature is in a column named 'height'

# Let's create a sample DataFrame with weight (in kg) and height (in cm) data
data = {'weight': [65, 70, 58, 78, 82, 60, 68, 75, 55, 80,
                    62, 72, 66, 76, 85, 59, 69, 73, 57, 81,
                    67, 74, 61, 77, 83, 56, 71, 74, 63, 79],
        'height': [165, 170, 158, 175, 180, 162, 168, 172, 155, 178,
                    163, 171, 166, 176, 182, 159, 169, 173, 157, 179,
                    167, 174, 161, 177, 181, 160, 170, 174, 164, 177]}
df = pd.DataFrame(data)

# Generate the scatter plot using matplotlib
plt.figure(figsize=(8, 6))
plt.scatter(df['height'], df['weight'])
plt.title('Scatter Plot of Weight vs. Height')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.grid(True)
plt.show()

# Alternatively, you can use seaborn for a more visually appealing scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x='height', y='weight', data=df)
plt.title('Scatter Plot of Weight vs. Height')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.grid(True)
plt.show()

# Determining the trend (based on the sample data):
print("\nDetermining the trend (based on the sample data):")
print("- By observing the scatter plot, we can look for a general pattern in how the points are distributed.")
print("- If the points tend to go upwards from left to right, it suggests a positive correlation (as height increases, weight tends to increase).")
print("- If the points tend to go downwards from left to right, it suggests a negative correlation (as height increases, weight tends to decrease).")
print("- If the points are scattered randomly with no clear direction, it suggests a weak or no linear correlation.")
print("- In the generated scatter plot (based on the sample data), you should likely see a general upward trend, indicating a positive correlation between height and weight.")

Title: Correlation Heatmap<br>

Question 5 : Generate a correlation heatmap for a dataset with multiple features (e.g., height ,weight , age ) and explain the correlations observed.

In [None]:
# Write your code from here
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your dataset is in a pandas DataFrame called 'df'
# and the weight feature is in a column named 'weight'
# and the height feature is in a column named 'height'

# Let's create a sample DataFrame with weight (in kg) and height (in cm) data
data = {'weight': [65, 70, 58, 78, 82, 60, 68, 75, 55, 80,
                    62, 72, 66, 76, 85, 59, 69, 73, 57, 81,
                    67, 74, 61, 77, 83, 56, 71, 74, 63, 79],
        'height': [165, 170, 158, 175, 180, 162, 168, 172, 155, 178,
                    163, 171, 166, 176, 182, 159, 169, 173, 157, 179,
                    167, 174, 161, 177, 181, 160, 170, 174, 164, 177]}
df = pd.DataFrame(data)

# Generate the scatter plot using matplotlib
plt.figure(figsize=(8, 6))
plt.scatter(df['height'], df['weight'])
plt.title('Scatter Plot of Weight vs. Height')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.grid(True)
plt.show()

# Alternatively, you can use seaborn for a more visually appealing scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x='height', y='weight', data=df)
plt.title('Scatter Plot of Weight vs. Height')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.grid(True)
plt.show()

# Determining the trend (based on the sample data):
print("\nDetermining the trend (based on the sample data):")
print("- By observing the scatter plot, we can look for a general pattern in how the points are distributed.")
print("- If the points tend to go upwards from left to right, it suggests a positive correlation (as height increases, weight tends to increase).")
print("- If the points tend to go downwards from left to right, it suggests a negative correlation (as height increases, weight tends to decrease).")
print("- If the points are scattered randomly with no clear direction, it suggests a weak or no linear correlation.")
print("- In the generated scatter plot (based on the sample data), you should likely see a general upward trend, indicating a positive correlation between height and weight.")

Title: Descriptive Statistical Analysis<br>

Question 6: Calculate the mean, median, standard deviation, skewness, and kurtosis for the
temperature feature and discuss the results.

In [None]:
# Write your code from here
import pandas as pd
from scipy.stats import skew, kurtosis

# Assuming your dataset is in a pandas DataFrame called 'df'
# and the temperature feature is in a column named 'temperature'

# Let's create a sample DataFrame with some temperature data (in Celsius)
data = {'temperature': [22.5, 23.1, 21.8, 24.5, 25.0, 22.0, 23.8, 24.2, 21.5, 24.8,
                        22.2, 23.5, 22.8, 24.0, 25.5, 21.9, 23.3, 24.3, 21.7, 24.6,
                        23.0, 23.9, 22.1, 24.4, 25.2, 21.6, 23.6, 24.1, 22.3, 24.7]}
df = pd.DataFrame(data)

# Calculate the descriptive statistics
mean_temp = df['temperature'].mean()
median_temp = df['temperature'].median()
std_dev_temp = df['temperature'].std()
skewness_temp = skew(df['temperature'])
kurtosis_temp = kurtosis(df['temperature'])

# Print the results
print(f"Mean Temperature: {mean_temp:.2f}")
print(f"Median Temperature: {median_temp:.2f}")
print(f"Standard Deviation of Temperature: {std_dev_temp:.2f}")
print(f"Skewness of Temperature: {skewness_temp:.2f}")
print(f"Kurtosis of Temperature: {kurtosis_temp:.2f}")

# Discuss the results (based on the sample data):
print("\nDiscussion of the results (based on the sample data):")
print(f"- **Mean:** The average temperature in our sample is {mean_temp:.2f} degrees Celsius. This represents the central tendency of the data.")
print(f"- **Median:** The middle value of the temperature data is {median_temp:.2f} degrees Celsius. Since the mean and median are quite close, it suggests that the distribution is likely fairly symmetrical.")
print(f"- **Standard Deviation:** The standard deviation of {std_dev_temp:.2f} indicates the typical amount of variation or dispersion of the temperatures around the mean. A smaller standard deviation would imply that the temperatures are clustered closer to the mean, while a larger value would suggest a wider spread.")
print(f"- **Skewness:** The skewness value is {skewness_temp:.2f}. Skewness measures the asymmetry of the distribution.")
print("  - A skewness close to 0 (like in our sample) suggests a roughly symmetrical distribution.")
print("  - A positive skewness would indicate a longer tail on the right side (more higher extreme values).")
print("  - A negative skewness would indicate a longer tail on the left side (more lower extreme values).")
print(f"- **Kurtosis:** The kurtosis value is {kurtosis_temp:.2f}. Kurtosis measures the 'tailedness' of the distribution compared to a normal distribution (which has a kurtosis of 0).")
print("  - A kurtosis close to 0 suggests a similar tailedness to a normal distribution (mesokurtic).")
print("  - A positive kurtosis (leptokurtic) indicates heavier tails and a sharper peak.")
print("  - A negative kurtosis (platykurtic) indicates thinner tails and a flatter peak.")
print("  - In our sample, the kurtosis suggests a distribution that is not markedly different in its tails compared to a normal distribution.")

Title: Covariance and Correlation between Two Features<br>

Question 7: Compute the covariance and correlation between price and demand in a dataset.
Explain what these metrics indicate.

In [None]:
# Write your code from here
import pandas as pd

# Assuming your dataset is in a pandas DataFrame called 'df'
# and the price feature is in a column named 'price'
# and the demand feature is in a column named 'demand'

# Let's create a sample DataFrame with price and demand data
data = {'price': [10, 12, 15, 13, 11, 16, 14, 17, 9, 18],
        'demand': [100, 90, 80, 85, 95, 75, 82, 70, 105, 65]}
df = pd.DataFrame(data)

# Compute the covariance
covariance = df['price'].cov(df['demand'])
print(f"Covariance between Price and Demand: {covariance:.2f}")

# Compute the correlation (Pearson correlation coefficient)
correlation = df['price'].corr(df['demand'])
print(f"Correlation between Price and Demand: {correlation:.2f}")

# Explain what these metrics indicate:
print("\nExplanation of Covariance and Correlation:")
print("- **Covariance:**")
print(f"  - The covariance of {covariance:.2f} indicates the direction of the linear relationship between price and demand.")
print("  - A negative covariance suggests that as price increases, demand tends to decrease (an inverse relationship), which is often expected in economics.")
print("  - A positive covariance would suggest that as price increases, demand tends to increase (a direct relationship), which might occur in specific scenarios but is less common for typical goods.")
print("  - The magnitude of the covariance is not easily interpretable on its own because it depends on the scales of the variables.")
print("- **Correlation:**")
print(f"  - The correlation coefficient of {correlation:.2f} provides a standardized measure of the strength and direction of the linear relationship between price and demand.")
print("  - The correlation coefficient ranges from -1 to +1.")
print("  - A correlation close to +1 indicates a strong positive linear relationship.")
print("  - A correlation close to -1 indicates a strong negative linear relationship.")
print("  - A correlation close to 0 indicates a weak or no linear relationship.")
print(f"  - In our sample, the correlation of {correlation:.2f} suggests a moderate negative linear relationship. This means that as the price tends to go up, the demand tends to go down, and the relationship is reasonably strong but not perfect.")
print("  - Correlation is often easier to interpret than covariance because it is scale-independent.")

Title: Pair Plot for Multivariate Analysis<br>

Question 8: Utilize a pair plot on a dataset to explore the relationships and distributions between
height , weight , and age . What insights can you glean from this visualization?

In [None]:
# Write your code from here
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming your dataset is in a pandas DataFrame called 'df'
# and you have columns named 'height', 'weight', and 'age'

# Let's create a sample DataFrame with height (cm), weight (kg), and age (years) data
data = {'height': [165, 170, 158, 175, 180, 162, 168, 172, 155, 178,
                   163, 171, 166, 176, 182, 159, 169, 173, 157, 179,
                   167, 174, 161, 177, 181, 160, 170, 174, 164, 177],
        'weight': [65, 70, 58, 78, 82, 60, 68, 75, 55, 80,
                    62, 72, 66, 76, 85, 59, 69, 73, 57, 81,
                    67, 74, 61, 77, 83, 56, 71, 74, 63, 79],
        'age': [25, 30, 22, 35, 40, 28, 26, 32, 29, 38,
                31, 33, 27, 36, 39, 24, 37, 34, 23, 41,
                29, 35, 26, 38, 42, 25, 30, 33, 28, 37]}
df = pd.DataFrame(data)

# Generate the pair plot using seaborn
plt.figure(figsize=(10, 10))
sns.pairplot(df[['height', 'weight', 'age']])
plt.suptitle("Pair Plot of Height, Weight, and Age", y=1.02) # Add a title
plt.show()

# Insights that can be gleaned from this visualization (based on the sample data):
print("\nInsights from the Pair Plot (based on the sample data):")
print("- **Univariate Distributions (on the diagonal):**")
print("  - The diagonal shows the distribution of each individual feature (height, weight, age). You can observe their shapes (e.g., roughly normal, skewed, etc.). For instance, you might see if the ages are clustered around a certain value or if the heights are spread out.")
print("- **Bivariate Relationships (off the diagonal):**")
print("  - The off-diagonal plots show the scatter plots between each pair of features:")
print("    - **Height vs. Weight:** You can look for a trend. In this sample, you'd likely see a positive trend, suggesting that taller individuals tend to weigh more.")
print("    - **Height vs. Age:** You can examine if there's any relationship between height and age. In a general population, you might expect height to increase with age up to a point and then plateau.")
print("    - **Weight vs. Age:** You can explore how weight might vary with age. This relationship can be more complex and might show different patterns depending on the life stages represented in the data.")
print("- **Correlation Strength:** The tightness of the points in the scatter plots can give a visual indication of the strength of the linear correlation between the variables. Tightly clustered points suggest a stronger linear relationship than loosely scattered points.")
print("- **Potential Outliers:** You might be able to visually identify potential outliers as points that are far away from the general clusters in the scatter plots.")
print("- **Non-linear Relationships:** While pair plots primarily highlight linear trends, you might also get hints of non-linear relationships if the scatter plots show curved patterns.")

# To use this code with your data:
# 1. Ensure your data is loaded into a pandas DataFrame.
# 2. Make sure you have columns named 'height', 'weight', and 'age' (adjust the list in sns.pairplot if your column names are different).
# 3. Run the code and carefully examine the resulting pair plot to understand the relationships and distributions in your data.

Title: Principal Component Analysis (PCA)<br>

Question 9 : Apply PCA on a dataset with multiple features (e.g., x1 , x2 , x3 , x4 ) and reduce it to two principal components. Visualize the data in the new feature space.

In [None]:
# Write your code from here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# 1. Generate Sample Data
np.random.seed(42)
data = {
    'x1': np.random.rand(100) * 10,
    'x2': 2 * np.random.rand(100) * 10 + np.random.normal(0, 2, 100),
    'x3': -1.5 * np.random.rand(100) * 10 + np.random.normal(0, 1, 100),
    'x4': 0.5 * np.random.rand(100) * 10 + np.random.normal(0, 3, 100)
}
df = pd.DataFrame(data)

# 2. Prepare the Data for PCA (Scaling is important)
X = df[['x1', 'x2', 'x3', 'x4']].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply PCA to Reduce to Two Principal Components
n_components = 2
pca = PCA(n_components=n_components)
principal_components = pca.fit_transform(X_scaled)

# 4. Create a New DataFrame with the Principal Components
pca_df = pd.DataFrame(data=principal_components, columns=['Principal Component 1', 'Principal Component 2'])

# 5. Visualize the Data in the New Feature Space
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Principal Component 1', y='Principal Component 2', data=pca_df)
plt.title('Data Visualized in 2D PCA Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()

# Optional: Print the Explained Variance Ratio
explained_variance_ratio = pca.explained_variance_ratio_
print(f"\nExplained Variance Ratio for 2 Principal Components: {explained_variance_ratio}")
print(f"Total Explained Variance: {sum(explained_variance_ratio):.2f}")
print("\nInterpretation:")
print("- The scatter plot shows the original data projected onto the two principal components.")
print("- Principal Component 1 captures the direction of the largest variance in the data.")
print("- Principal Component 2 captures the direction of the second largest variance, orthogonal to the first.")
print("- The explained variance ratio indicates the proportion of the dataset's variance captured by each principal component. A higher total explained variance suggests that these two components do a good job of representing the original data.")

Title: Principal Component Analysis (PCA)<br>

Question 9 : Apply PCA on a dataset with multiple features (e.g., x1 , x2 , x3 , x4 ) and reduce it to two principal components. Visualize the data in the new feature space.

Title: Advanced Pair Plot with Hue Parameter<br>

Question 10 : Create a pair plot for height , weight , and age with an added categorical variable gender as the hue to observe different group trends.

In [None]:
# Write your code from here
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Generate Sample Data with Gender
np.random.seed(42)
n_samples = 100
data = {
    'height': np.concatenate([np.random.normal(170, 10, n_samples // 2), np.random.normal(160, 8, n_samples // 2)]),
    'weight': np.concatenate([np.random.normal(70, 15, n_samples // 2), np.random.normal(55, 10, n_samples // 2)]),
    'age': np.random.randint(20, 50, n_samples),
    'gender': ['Male'] * (n_samples // 2) + ['Female'] * (n_samples // 2)
}
df = pd.DataFrame(data)

# 2. Generate the Pair Plot with Hue
plt.figure(figsize=(12, 12))
sns.pairplot(df, hue='gender', diag_kind='kde', markers=['o', 's'])
plt.suptitle("Pair Plot of Height, Weight, and Age by Gender", y=1.02)
plt.show()

# Insights that can be gleaned from this visualization (based on the sample data):
print("\nInsights from the Pair Plot with Hue (based on the sample data):")
print("- **Univariate Distributions by Gender (on the diagonal):**")
print("  - The diagonal now shows the Kernel Density Estimate (KDE) for each feature, separated by gender. This allows you to compare the distribution of height, weight, and age between males and females in your sample.")
print("- **Bivariate Relationships by Gender (off the diagonal):**")
print("  - The scatter plots for each pair of features are now colored by gender (e.g., 'Male' might be one color and 'Female' another). This helps you observe if the relationship between two variables differs for each gender.")
print("    - **Height vs. Weight by Gender:** You can see if the positive trend between height and weight is similar for both genders or if there are noticeable differences in the slope or spread.")
print("    - **Height vs. Age by Gender:** You can examine if the relationship between height and age varies between males and females (e.g., if males tend to be taller at certain ages or if the age at which height plateaus differs).")
print("    - **Weight vs. Age by Gender:** You can explore how weight changes with age for each gender. This might reveal different patterns related to life stages, metabolism, etc.")
print("- **Group Separation:** The pair plot can visually indicate if there are clear separations or overlaps between the gender groups in the feature space. For example, if males tend to have higher heights and weights, you might see some separation in those scatter plots.")
print("- **Different Correlation Strengths:** You might observe that the correlation between two variables is stronger for one gender compared to the other (e.g., the points might be more tightly clustered for males than females in a specific scatter plot).")

# To use this code with your data:
# 1. Ensure your data is loaded into a pandas DataFrame.
# 2. Make sure you have columns named 'height', 'weight', 'age', and a categorical column representing 'gender' (or your grouping variable).
# 3. Replace 'gender' in the `hue` parameter of `sns.pairplot()` with the actual name of your categorical column.
# 4. Run the code and carefully examine the resulting pair plot to understand the relationships and distributions within each group.