Question 1: Understanding the Dataset 
<br>
Description: Load a dataset and understand its basic properties including data types dimensions, and first few rows

In [21]:
import pandas as pd

# Load a dataset (replace 'your_dataset.csv' with the actual path to your file)
# For demonstration, let's create a simple DataFrame
data = {'col1': [1, 2, 3, 4, 5],
        'col2': ['a', 'b', 'c', 'd', 'e'],
        'col3': [1.1, 2.2, 3.3, 4.4, 5.5]}
df = pd.DataFrame(data)

# Understand basic properties

# Data types of each column
print("Data Types:")
print(df.dtypes)
print("\n")

# Dimensions of the DataFrame (number of rows and columns)
print("Dimensions:")
print(df.shape)
print("\n")

# Number of rows
print("Number of Rows:")
print(len(df))
print("\n")

# Number of columns
print("Number of Columns:")
print(len(df.columns))
print("\n")

# Index of the DataFrame (row labels)
print("Index:")
print(df.index)
print("\n")

# Columns of the DataFrame (column labels)
print("Columns:")
print(df.columns)
print("\n")

# First few rows of the DataFrame
print("First 5 Rows:")
print(df.head())
print("\n")

# First n rows (e.g., first 3 rows)
n = 3
print(f"First {n} Rows:")
print(df.head(n))
print("\n")

# Concise summary of the DataFrame, including data types and non-null values
print("DataFrame Info:")
df.info()


Data Types:
col1      int64
col2     object
col3    float64
dtype: object


Dimensions:
(5, 3)


Number of Rows:
5


Number of Columns:
3


Index:
RangeIndex(start=0, stop=5, step=1)


Columns:
Index(['col1', 'col2', 'col3'], dtype='object')


First 5 Rows:
   col1 col2  col3
0     1    a   1.1
1     2    b   2.2
2     3    c   3.3
3     4    d   4.4
4     5    e   5.5


First 3 Rows:
   col1 col2  col3
0     1    a   1.1
1     2    b   2.2
2     3    c   3.3


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    5 non-null      int64  
 1   col2    5 non-null      object 
 2   col3    5 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes


Question 2: Checking for Missing Values
<br>
Description: Identify missing values in the dataset.

In [22]:
import pandas as pd
import numpy as np

# Load a dataset (replace 'your_dataset.csv' with the actual path to your file)
# For demonstration, let's create a DataFrame with some missing values
data = {'col1': [1, 2, np.nan, 4, 5],
        'col2': ['a', np.nan, 'c', 'd', 'e'],
        'col3': [1.1, 2.2, 3.3, np.nan, 5.5],
        'col4': [True, False, True, True, np.nan]}
df = pd.DataFrame(data)

# Identify missing values

# 1. Check for NaN (Not a Number) values in the entire DataFrame
print("DataFrame with Missing Values:")
print(df)
print("\n")

print("Boolean DataFrame indicating missing values (True if missing, False otherwise):")
print(df.isnull())
print("\n")

# 2. Count the number of missing values in each column
print("Number of missing values per column:")
print(df.isnull().sum())
print("\n")

# 3. Count the total number of missing values in the entire DataFrame
total_missing = df.isnull().sum().sum()
print(f"Total number of missing values in the DataFrame: {total_missing}")
print("\n")

# 4. Get the percentage of missing values in each column
missing_percentage = (df.isnull().sum() / len(df)) * 100
print("Percentage of missing values per column:")
print(missing_percentage)
print("\n")

# 5. Check if there are any missing values in the entire DataFrame (returns True if any missing value exists)
has_missing = df.isnull().any().any()
print(f"Are there any missing values in the DataFrame? {has_missing}")

DataFrame with Missing Values:
   col1 col2  col3   col4
0   1.0    a   1.1   True
1   2.0  NaN   2.2  False
2   NaN    c   3.3   True
3   4.0    d   NaN   True
4   5.0    e   5.5    NaN


Boolean DataFrame indicating missing values (True if missing, False otherwise):
    col1   col2   col3   col4
0  False  False  False  False
1  False   True  False  False
2   True  False  False  False
3  False  False   True  False
4  False  False  False   True


Number of missing values per column:
col1    1
col2    1
col3    1
col4    1
dtype: int64


Total number of missing values in the DataFrame: 4


Percentage of missing values per column:
col1    20.0
col2    20.0
col3    20.0
col4    20.0
dtype: float64


Are there any missing values in the DataFrame? True


Question 3: Descriptive Statistics
<br>
Description: Calculate descriptive statistics for numerical columns.

In [23]:
import pandas as pd
import numpy as np

# Load a dataset (replace 'your_dataset.csv' with the actual path to your file)
# For demonstration, let's create a DataFrame with numerical and non-numerical columns
data = {'col1': [1, 2, 3, 4, 5],
        'col2': ['a', 'b', 'c', 'd', 'e'],
        'col3': [1.1, 2.2, 3.3, 4.4, 5.5],
        'col4': [10, 20, 30, 40, 50],
        'col5': [np.nan, 2, 3, np.nan, 5]}
df = pd.DataFrame(data)

# Calculate descriptive statistics for numerical columns

# 1. Get descriptive statistics for all numerical columns
print("Descriptive Statistics for all numerical columns:")
print(df.describe())
print("\n")

# 2. Get descriptive statistics for a specific numerical column
column_name = 'col3'
print(f"Descriptive Statistics for column '{column_name}':")
print(df[column_name].describe())
print("\n")

# 3. Calculate specific descriptive statistics for numerical columns

# Mean of all numerical columns
print("Mean of numerical columns:")
print(df.mean(numeric_only=True))
print("\n")

# Mean of a specific numerical column
print(f"Mean of column '{column_name}': {df[column_name].mean()}")
print("\n")

# Median of all numerical columns
print("Median of numerical columns:")
print(df.median(numeric_only=True))
print("\n")

# Standard deviation of all numerical columns
print("Standard deviation of numerical columns:")
print(df.std(numeric_only=True))
print("\n")

# Minimum value of all numerical columns
print("Minimum value of numerical columns:")
print(df.min(numeric_only=True))
print("\n")

# Maximum value of all numerical columns
print("Maximum value of numerical columns:")
print(df.max(numeric_only=True))
print("\n")

# Count of non-missing values in each column
print("Count of non-missing values in each column:")
print(df.count())
print("\n")

# Quantiles of all numerical columns (e.g., 25th, 50th, 75th percentiles)
print("Quantiles of numerical columns:")
print(df.quantile([0.25, 0.5, 0.75], numeric_only=True))
print("\n")

Descriptive Statistics for all numerical columns:
           col1      col3       col4      col5
count  5.000000  5.000000   5.000000  3.000000
mean   3.000000  3.300000  30.000000  3.333333
std    1.581139  1.739253  15.811388  1.527525
min    1.000000  1.100000  10.000000  2.000000
25%    2.000000  2.200000  20.000000  2.500000
50%    3.000000  3.300000  30.000000  3.000000
75%    4.000000  4.400000  40.000000  4.000000
max    5.000000  5.500000  50.000000  5.000000


Descriptive Statistics for column 'col3':
count    5.000000
mean     3.300000
std      1.739253
min      1.100000
25%      2.200000
50%      3.300000
75%      4.400000
max      5.500000
Name: col3, dtype: float64


Mean of numerical columns:
col1     3.000000
col3     3.300000
col4    30.000000
col5     3.333333
dtype: float64


Mean of column 'col3': 3.3


Median of numerical columns:
col1     3.0
col3     3.3
col4    30.0
col5     3.0
dtype: float64


Standard deviation of numerical columns:
col1     1.581139
col3    

Question 4: Handling Outliers
<br>
Description: Identify outliers in numerical columns using box plots.

In [24]:
# Write your code from here


Question 5: Categorical Data Analysis
<br>
Description: Explore the counts of categorical variables.

In [25]:
# Write your code from here

Question 6: Data Transformation
<br>
Description: Transform a categorical column into numerical using Label Encoding.

In [26]:
# Write your code from here

Question 7: Visualizing Data Distributions
<br>
Description: Plot histograms for numerical columns to understand distributions.

In [27]:
# Write your code from here

Question 8: Correlation Analysis
<br>
Description: Calculate and visualize the correlation matrix for numerical features.

In [28]:
# Write your code from here

Question 9: Feature Engineering
<br>
Description: Create a new feature by combining or transforming existing features.

In [29]:
# Write your code from here

Question 10: Advanced Outlier Detection
<br>
Description: Use the Z-score method to identify and handle outliers.

In [30]:
# Write your code from here