

## **Assignment 1: Customer and Transaction Analysis**

#### **Objective**
This assignment is designed to give you hands-on experience with data analysis tasks using Pandas in Python. You will perform data aggregation, data preprocessing, outlier detection, table joining, diagnostic analysis, and predictive analysis on customer, transaction, product, and supplier data.

#### **Dataset Overview**
You will work with four interconnected tables:
1. **Customers**: Basic information about customers.
2. **Products**: Details about each product.
3. **Suppliers**: Information about product suppliers.
4. **Transactions**: Records of each transaction made by customers.

#### **Assignment Tasks**

#### **Part 1: Data Aggregation**
1. Calculate the **total revenue** generated by each product category.
2. Find the **average income** of customers based on their location.
3. Determine the **total quantity of products** bought by each customer. Identify the top 10 customers by quantity purchased.

#### **Part 2: Data Preprocessing**
1. Handle any **missing values** in each table. Document your approach and reasoning.
2. **Standardize** product categories (e.g., make all lowercase or ensure consistent naming).
3. **Encode** categorical variables where necessary (e.g., Gender, Employment Status) for later analysis.

#### **Part 3: Outlier Detection**
1. **Identify and analyze outliers** in the Income column for the Customers table. Discuss any potential reasons for these outliers.
2. Detect outliers in **Quantity** and **Price** columns of the Transactions table, using statistical or visual methods (e.g., box plots). What are your observations?

#### **Part 4: Joining Tables**
1. **Merge** the Transactions and Customers tables to create a consolidated view of each transaction, enriched with customer demographics.
2. Join the Products table to this consolidated dataset to add product details to each transaction.
3. Finally, **link the Supplier information** for each transaction using ProductID and SupplierID.

#### **Part 5: Diagnostic Analysis**
1. Analyze the relationship between **customer income and total spending** across transactions. Do higher-income customers spend more?
2. Investigate the **impact of product categories** on total spending. Which product categories contribute most to revenue?
3. Perform a **time-series analysis** on the transaction data to identify peak sales periods within the year. Does spending vary by season or month?

#### **Part 6: Predictive Analysis**
1. Build a **linear regression model** to predict the transaction amount based on customer demographics (e.g., Age, Income) and product category.
2. Assess the **accuracy of your model** and discuss any insights or limitations. What variables seem to influence transaction amounts most?



#### **Tips for Success**
- Use Pandas functions like `.groupby()`, `.merge()`, `.isnull()`, `.fillna()`, and `.describe()` to streamline your analysis.
- For predictive analysis, you may use `sklearn.linear_model.LinearRegression` or similar models available in scikit-learn.
- Document each step clearly, especially when making assumptions or decisions during preprocessing.




### **Assignment2: Customer Churn Prediction**

#### **Objective**
This assignment will guide you in using Python and Pandas for data exploration and preparation, along with scikit-learn to build a machine learning model. Your task is to predict whether a customer is likely to make another purchase (Churn = 1) or not (Churn = 0).

#### **Dataset Overview**
The dataset contains 400 customer records with the following columns:
- **CustomerID**: Unique identifier for each customer.
- **Age**: Customer's age.
- **Gender**: Customer's gender (Male or Female).
- **Location**: Customer's city.
- **Income**: Customer’s annual income.
- **Education**: Highest level of education.
- **Employment Status**: Employment status (e.g., Employed, Unemployed, Self-employed, Retired).
- **Churn**: Target variable (1 for likely to return, 0 for unlikely to return).

#### **Assignment Tasks**

### **Part 1: Exploratory Data Analysis (EDA)**
1. **Data Overview**: Display the first few rows of the dataset and get a summary of each column.
2. **Descriptive Statistics**: Analyze key statistics for numerical columns (e.g., Age, Income) and summarize categorical columns (Gender, Education, Employment Status).
3. **Visualization**:
   - Create a bar chart showing the distribution of customer churn.
   - Visualize the distribution of Age and Income for customers who churn vs. those who don’t.
   - Analyze the influence of categorical variables (e.g., Education, Employment Status) on churn by creating relevant plots.

### **Part 2: Data Preprocessing**
1. **Missing Values**: Check for missing values in each column and handle them appropriately.
2. **Encoding Categorical Variables**:
   - Convert categorical columns (Gender, Education, Employment Status) to numeric formats using one-hot encoding or label encoding.
3. **Feature Scaling**: Scale numerical features like Age and Income if needed, using standard scaling or normalization.

### **Part 3: Feature Engineering**
1. **Create New Features**:
   - Generate a new feature, such as “Income per Age,” by dividing Income by Age, to understand spending power based on age.
2. **Feature Selection**: Decide which features are likely to contribute most to predicting churn. Provide reasoning for your selections.

### **Part 4: Machine Learning Task - Decision Tree Model**
1. **Train-Test Split**: Split the dataset into training and testing sets (e.g., 80% train, 20% test).
2. **Decision Tree Classifier**:
   - Train a Decision Tree Classifier on the training data, using the selected features to predict churn.
   - Adjust hyperparameters like maximum depth and minimum samples split to improve performance.
3. **Model Evaluation**:
   - Calculate accuracy, precision, recall, and F1-score on the test set.
   - Plot the decision tree and identify the most important features for predicting churn.
   
### **Part 5: Analysis and Conclusion**
1. **Model Performance**: Discuss the model’s performance metrics and interpret what they mean in the context of customer churn.
2. **Feature Importance**: List the top features the model used for prediction and discuss why these may be influential in predicting churn.
3. **Future Improvements**: Suggest at least two ways to improve the model’s performance, such as collecting more data, using different models, or engineering additional features.
