A collection of data science projects focusing on data exploration, visualization, and machine learning using Python.
- Python
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
- Jupyter Notebook
Explore and understand a real-world dataset through data inspection and visual analysis.
The Iris dataset contains measurements of 150 flowers from 3 different species — Setosa, Versicolor, and Virginica. For each flower, 4 measurements were recorded: sepal length, sepal width, petal length, and petal width.
- Loaded the dataset using Seaborn's built-in dataset loader
- Inspected the structure using shape, columns, and head()
- Generated summary statistics using describe()
- Created visualizations to understand patterns in the data
- Scatter Plot — shows the relationship between sepal length and petal length across species
- Histogram — shows the distribution of petal length across all flowers
- Box Plot — shows the spread and outliers of sepal width for each species
- Setosa flowers are clearly separable from the other two species based on petal measurements
- Petal length shows a bimodal distribution indicating two natural groups in the data
- Setosa has a noticeably wider sepal width compared to Versicolor and Virginica
- The scatter plot shows a strong positive correlation between sepal length and petal length
Predict whether a loan applicant is likely to default on a loan using machine learning classification.
The Loan Prediction Dataset contains information about 614 loan applicants from an Indian bank including their income, education, employment status, credit history, loan amount, and whether their loan was approved or rejected.
- Loaded and inspected both training and testing CSV files
- Identified and handled missing values using mode for categorical columns and median for numerical columns
- Converted all categorical text columns to numerical values for model compatibility
- Visualized key features including loan amount distribution, education vs loan status, and applicant income
- Trained a Logistic Regression model on 80% of the data
- Evaluated the model on the remaining 20%
- Histogram — distribution of loan amounts showing most common borrowing ranges
- Count Plot — education level vs loan approval status comparison
- Histogram — distribution of applicant income showing income spread
- Model achieved approximately 79% accuracy on unseen test data
- Credit history was the strongest predictor of loan approval
- The model was very good at identifying approved applicants but missed some defaulters
- Graduates had a higher loan approval rate compared to non-graduates
- Most applicants had incomes clustered in the lower range with a few high income outliers
Identify bank customers who are likely to close their account and leave the bank using machine learning classification.
The Churn Modelling Dataset contains information about 10,000 bank customers including their age, geography, gender, account balance, credit score, number of products, and whether they eventually left the bank (churned).
- Loaded and inspected the dataset
- Dropped irrelevant columns such as row number, customer ID, and surname
- Encoded categorical features — Gender using binary mapping and Geography using One-Hot Encoding
- Visualized churn distribution, age vs churn, and balance vs churn
- Trained a Random Forest Classifier on 80% of the data
- Analyzed feature importance to understand what drives churn
- Count Plot — overall churn vs retention distribution
- Histogram — age distribution comparing churned vs retained customers
- Histogram — account balance distribution comparing churned vs retained customers
- Age was one of the strongest predictors of churn — older customers churned more
- Customers with higher account balances were more likely to leave
- Geography played a significant role — German customers had higher churn rates
- The Random Forest model outperformed simpler models due to its ensemble approach
Python-Data-Science-Analytics/ │ ├── task1_iris_exploration.ipynb
├── task2_credit_risk_prediction.ipynb
├── task3_customer_churn_prediction.ipynb
└── README.md
- Clone this repository
git clone https://github.com/yourusername/Python-Data-Science-Analytics.git- Install required libraries
pip install pandas matplotlib seaborn scikit-learn- Open any notebook in VS Code or Jupyter and run all cells
More tasks coming soon