In Exploratory Data Analysis (EDA), data scientists employ various techniques to gain insights and understand the structure, patterns, and relationships within the dataset. Here are some additional activities commonly performed during EDA:


Data Summary Statistics:

Compute descriptive statistics such as mean, median, mode, standard deviation, range, and quartiles for numerical variables.
Summarize categorical variables with frequency counts or proportions of each category.


Data Visualization:

Create visual representations of the data using charts, graphs, and plots to reveal patterns and relationships.
Common visualization techniques include histograms, box plots, scatter plots, bar charts, line plots, and heatmaps.


Data Distribution Analysis:

Examine the distribution of numerical variables to understand their spread, skewness, and presence of outliers.
Use probability density plots (e.g., kernel density estimation) to visualize the distribution of continuous variables.


Correlation Analysis:

Compute correlation coefficients (e.g., Pearson correlation, Spearman rank correlation) to quantify the strength and direction of relationships between pairs of variables.
Visualize correlations using correlation matrices or heatmaps to identify clusters of correlated variables.


Dimensionality Reduction:

Apply dimensionality reduction techniques such as Principal Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-SNE) to visualize high-dimensional data in lower-dimensional space.
Explore the explained variance ratio to understand the contribution of each principal component.


Outlier Detection:

Identify outliers or anomalies in the data that deviate significantly from the rest of the observations.
Use visualization tools like scatter plots or box plots, or statistical methods like Z-score or interquartile range (IQR) to detect outliers.



Missing Values Analysis:

Investigate the presence and patterns of missing values in the dataset.
Assess the impact of missing data on analysis and consider strategies for handling missing values (e.g., imputation, deletion).


Feature Importance Analysis:

Assess the importance of features in predicting the target variable using techniques like feature importance scores from tree-based models or permutation feature importance.
Visualize feature importance rankings to prioritize features for further analysis or modeling.


Time-Series Analysis (if applicable):

Decompose time-series data into trend, seasonal, and residual components to identify patterns and seasonal effects.
Plot time-series data and perform autocorrelation analysis to detect temporal dependencies and seasonality.


Hypothesis Testing (if applicable):

Formulate hypotheses about relationships or differences between groups in the data.



Conduct statistical tests (e.g., t-tests, ANOVA) to assess the significance of observed differences and validate hypotheses.
By conducting these additional activities during EDA, data scientists can uncover deeper insights into the dataset, identify potential challenges or biases, and inform subsequent steps in the data analysis process, including feature engineering, modeling, and interpretation.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('dataset.csv')

# Data Summary Statistics
print("Descriptive Statistics for Numerical Variables:")
print(df.describe())

print("\nFrequency Counts for Categorical Variables:")
print(df['category_column'].value_counts())

# Data Visualization
# Histograms and Density Plots
plt.figure(figsize=(10, 6))
sns.histplot(df['numerical_column'], kde=True)
plt.title('Histogram of Numerical Column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

# Box Plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='category_column', y='numerical_column', data=df)
plt.title('Box Plot of Numerical Column by Category')
plt.xlabel('Category')
plt.ylabel('Numerical Column')
plt.show()

# Scatter Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='numerical_column1', y='numerical_column2', data=df)
plt.title('Scatter Plot of Numerical Column1 vs Numerical Column2')
plt.xlabel('Numerical Column1')
plt.ylabel('Numerical Column2')
plt.show()

# Bar Plot
plt.figure(figsize=(10, 6))
sns.countplot(x='category_column', data=df)
plt.title('Bar Plot of Category Column')
plt.xlabel('Category')
plt.ylabel('Count')
plt.show()

# Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Data Distribution Analysis
print("Skewness:", df['numerical_column'].skew())
print("Kurtosis:", df['numerical_column'].kurt())

# Outlier Detection
# Z-score method
outliers_zscore = df[np.abs(df['numerical_column'] - df['numerical_column'].mean()) > (3 * df['numerical_column'].std())]

# Visualize outliers
plt.figure(figsize=(10, 6))
sns.boxplot(x='category_column', y='numerical_column', data=df)
sns.scatterplot(x='category_column', y='numerical_column', data=outliers_zscore, color='red', marker='o', label='Outliers')
plt.title('Box Plot with Outliers Detected by Z-score Method')
plt.xlabel('Category')
plt.ylabel('Numerical Column')
plt.legend()
plt.show()

# Missing Values Analysis
print("Missing Values Summary:")
print(df.isnull().sum())

# Feature Importance Analysis (using Random Forest as an example)
from sklearn.ensemble import RandomForestClassifier

# Drop target variable and non-numeric columns
X = df.drop(['target_column', 'non_numeric_column'], axis=1)
y = df['target_column']

# Train Random Forest model
rf = RandomForestClassifier()
rf.fit(X, y)

# Feature importance
feature_importance = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)

# Visualize feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x=feature_importance, y=feature_importance.index)
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

# Time-Series Analysis (if applicable)
# Load time-series data
time_series_df = pd.read_csv('time_series_data.csv')

# Plot time-series data
plt.figure(figsize=(10, 6))
plt.plot(time_series_df['date_column'], time_series_df['value_column'])
plt.title('Time-Series Plot')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

# Hypothesis Testing (if applicable)
from scipy.stats import ttest_ind

# Example of t-test
group1 = df[df['group_column'] == 'Group1']['numeric_column']
group2 = df[df['group_column'] == 'Group2']['numeric_column']

t_stat, p_value = ttest_ind(group1, group2)

print("T-statistic:", t_stat)
print("P-value:", p_value)



KeyboardInterrupt

