## **Income Prediction From US Census Adult Income Data**

Step into the realm of predictive analysis alongside Mia, the income predictor, in our project, "Income Prediction From US Census Adult Income Data." With Python as her language of choice and a repertoire of powerful algorithms at her disposal, Mia ventures into the world of data to unravel the complexities of income prediction.

An individual's annual income is shaped by numerous factors, such as education level, age, gender, and occupation. In this project, we delve into the task of income prediction using the US Adult Income Census dataset. Our primary goal is to accurately predict whether an individual's income falls above or below the $50,000 threshold.

The dataset at hand encompasses 16 columns, with the focal point being the "Income" field, divided into two classes: <=50K and >50K. With 14 other attributes detailing demographics and individual features, we aim to explore the potential for predicting income levels based on personal information.

Employing a suite of algorithms including Logistic Regression, KNN, LDA, Decision Trees, Gaussian NB, Random Forest, and SVC, Mia is equipped to unearth patterns in the data and make precise income predictions based on individual information.

Our project isn't merely about predicting incomes; it's a journey that unlocks the potential to understand and forecast an individual's financial standing based on a diverse range of factors.

By the end of this project, Mia doesn't just compute predictions; she uncovers the potential to enhance the understanding of income dynamics based on personal attributes.

Join Mia on this enlightening journey, where every algorithm and line of code illuminates the path to precise income predictions. Together, we'll unravel the complexities of income prediction and offer valuable insights to empower individuals and guide strategic decisions in various fields.

## Module 1
### Task 1: Loading Adult Income Data

In this task, we load the adult income data from the 'Income Prediction Adult Data.csv' file into a Pandas DataFrame named 'df.' This step is crucial for our new project, "Income Prediction From US Census Adult Income Data," as it forms the foundation for the data analysis and insights that we aim to derive from the income dataset.

In [None]:
#--- Import Pandas ---

#--- Read in dataset(Income Prediction Adult Data.csv) ----
df = ...

#--- Inspect data ---

### Task 2: Identifying Null Values in Adult Income Data

In this task, we identify and count the null values within the 'df' dataset. Recognizing null values is crucial to understand the completeness of the dataset. This process helps in determining whether any missing values exist within the data. Addressing or imputing null values is essential for ensuring accurate statistical analyses and model building for income prediction based on the US Census adult income data.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 1 TASK 2 ---
sumofnull = ...

#--- Inspect data ---

### Task 3: Identifying Data Types in Adult Income Data

In this task, we determine the data types of the columns in the 'df' dataset. Understanding the data types is crucial for data analysis and preprocessing. This information aids in selecting appropriate statistical methods and feature engineering techniques that are relevant for the income prediction analysis based on the US Census adult income data.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 1 TASK 3 ---
dtype = ...

#--- Inspect data ---

### Task 4: Categorizing Features in Adult Income Data

In this task, we categorize features within the 'df' dataset into numerical and categorical groups. Identifying numeric and categorical columns helps in defining how different features will be handled during data preprocessing and model building for income prediction based on the US Census adult income data. This categorization is essential for selecting appropriate statistical methods and preprocessing steps tailored for different types of features.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 1 TASK 4 ---
numeric_features = ...
cat_features = ...
#--- Inspect data ---

### Task 5: Handling Missing Values in Adult Income Data

In this task, a function named 'replace_missing_values' is utilized to replace missing values denoted as '?' in the 'workclass', 'occupation', and 'native.country' columns. By replacing these missing values with a defined 'Unknown' string or a specific country ('United-States'), this process helps in handling missing or unknown data within the dataset for a more accurate and comprehensive analysis for income prediction using the US Census adult income data.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 1 TASK 5 ---
#df = ...

#--- Inspect data ---

### Task 6: Mapping Income Categories in Adult Income Data

In this task, a dictionary 'income_map' is created to map income categories represented as '<=50K' and '>50K' to numerical values 0 and 1, respectively, within the 'income' column. This mapping is essential for creating a binary classification target variable, enabling predictive analysis for income levels in the US Census adult income data.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 1 TASK 6 ---
#df = ...

#--- Inspect data ---

## Module 2
### Task 1: Exploring Correlations in Adult Income Data

This task involves generating a heatmap using Seaborn to visualize the correlation matrix among the numerical features specified in the 'numeric_features' list within the 'df' dataset. This heatmap provides a visual representation of how strongly each numerical feature correlates with one another. Understanding these correlations is vital for feature selection and model building in the analysis of income prediction based on the US Census adult income data.

In [None]:
#--- Import matplotlib and  seaborn---

# --- WRITE YOUR CODE FOR MODULE 2 TASK 1 ---
corr_data = ...

### Task 2: Visualizing Income Distribution

This task involves creating a count plot using Seaborn to illustrate the distribution of income levels within the 'income' column of the 'df' dataset. This plot helps visualize the distribution of income categories (<=50K and >50K) in the US Census adult income data. Understanding this distribution is crucial for understanding the dataset's class balance, an essential factor for accurate model training and prediction in income analysis.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 2 TASK 2 ---
income_ax = ...

### Task 3: Comparing Education and Income
This task involves using Seaborn's catplot to create a bar chart to compare the relationship between education (represented by 'education.num') and income ('income') within the 'df' dataset. The visualization helps understand how income levels vary concerning different education levels, providing insights into potential trends or differences based on education in the US Census adult income data analysis.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 2 TASK 3 ---
catplt_edu = ...

### Task 4: Analyzing Income and Weekly Working Hours

This task employs Seaborn's catplot to create a bar chart for exploring the relationship between income ('income') and the number of hours worked per week ('hours.per.week') within the 'df' dataset. This visualization helps to understand how income levels vary concerning the hours worked per week, providing insights into potential trends or differences based on working hours in the US Census adult income data analysis.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 2 TASK 4 ---
catplt_hrs = ...


### Task 5: Comparing Age Distribution by Income Level

This task involves utilizing Seaborn and Matplotlib to create a FacetGrid with separate columns for each income level ('<=50K' and '>50K') to depict the distribution of ages within the 'df' dataset. By visualizing age distribution concerning income levels, this plot provides insights into how age factors into income disparities, aiding in the analysis of the US Census adult income data.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 2 TASK 5 ---
hist_age = ...


### Task 6: Analyzing Income Based on Relationships

This task employs Seaborn's catplot to create a bar chart comparing income levels ('income') with different types of relationships ('relationship') within the 'df' dataset. This visualization aims to understand how income varies concerning various relationships, offering insights into potential income differences among different relationship statuses in the US Census adult income data analysis.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 2 TASK 6 ---
catplt_relat = ...


### Task 7: Analyzing Income Based on Gender

This task utilizes Seaborn's catplot to create a bar chart comparing income levels ('income') based on gender ('sex') within the 'df' dataset. This visualization aims to understand how income levels differ between genders, providing insights into potential income disparities based on gender in the US Census adult income data analysis.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 2 TASK 7 ---
catplt_sex = ...


### Task 8: Analyzing Income Based on Marital Status
This task employs Seaborn's catplot to create a bar chart comparing income levels ('income') based on marital status ('marital.status') within the 'df' dataset. This visualization aims to understand how income levels differ across different marital statuses, providing insights into potential income disparities related to marital status in the US Census adult income data analysis.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 2 TASK 8 ---
catplt_married  = ...


### Task 9: Analyzing Income Based on Workclass

In this task, Seaborn's catplot is used to create a bar chart comparing income levels ('income') based on different work classes ('workclass') within the 'df' dataset. This visualization aims to understand how income levels differ across various work classes, providing insights into potential income disparities related to different employment categories in the US Census adult income data analysis.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 2 TASK 9 ---
catplt_wrkcls  = ...

## Module 3
### Task 1: Data Preprocessing for Relationship and Gender

This task involves creating a modified 'data' DataFrame by encoding the 'sex' attribute from strings 'Male' and 'Female' to numerical values 0 and 1, respectively. Furthermore, it consolidates the 'marital.status' attribute into two categories 'Married' and 'Single' for improved analysis and replaces these categories with binary values 1 and 0, respectively. This preprocessing is vital for refining and simplifying the relationship and gender attributes to enhance the accuracy and interpretability of the data for further analysis in the US Census adult income data.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 3 TASK 1 ---
data  = ...

### Task 2: Removing Certain Attributes for Data Simplification

This task involves dropping specific columns, including "workclass", "education", "occupation", "relationship", "race", and "native.country" from the 'data' DataFrame. The removal of these columns simplifies the dataset by eliminating non-essential or redundant attributes, focusing the analysis on the more relevant features for the US Census adult income data analysis.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 3 TASK 2 ---
#data  = ...

## Module 4
### Task 1: Data Splitting for Modeling in Adult Income Data

This task involves splitting the 'data' DataFrame into training and validation sets for modeling. The feature set 'x' is generated by dropping the 'income' column, while the target variable 'y' consists solely of the 'income' column. Utilizing the train_test_split function from sklearn.model_selection, the data is split into 80% training data ('X_train' and 'Y_train') and 20% validation data ('X_validation' and 'Y_validation'). This step is essential for training and evaluating models in the US Census adult income data analysis.

In [None]:
#--- Import train_test_split ---
# --- WRITE YOUR CODE FOR MODULE 4 TASK 1 ---
#X_train, X_validation, Y_train, Y_validation  = ...

#--- Inspect data ---

### Task 2: Training and Evaluating Logistic Regression Model

This task involves training a Logistic Regression model using the training data 'X_train' and 'Y_train' and then assessing the model's performance using cross-validation with 10 folds. The cross_val_score function from sklearn.model_selection assesses the Logistic Regression model's performance and computes the mean accuracy score over the 10 folds. This evaluation is crucial for understanding the model's predictive capability in the US Census adult income data analysis.

In [None]:
#--- Import LogisticRegression ---
#--- Import cross_val_score from sklearn.model_selection ---

# --- WRITE YOUR CODE FOR MODULE 4 TASK 2 ---
#lr_mean_score  = ...

#--- Inspect data ---

### Task 3: Training and Evaluating Linear Discriminant Analysis Model

This task involves training a Linear Discriminant Analysis (LDA) model using the training data 'X_train' and 'Y_train' and then assessing the model's performance via 10-fold cross-validation. The cross_val_score function from sklearn.model_selection is used to evaluate the LDA model's accuracy across the 10 folds. This evaluation is vital for understanding the model's performance in predicting income classes within the US Census adult income data analysis.

In [None]:
#--- Import LinearDiscriminantAnalysis ---

# --- WRITE YOUR CODE FOR MODULE 4 TASK 3 ---
#ldr_mean_score  = ...

#--- Inspect data ---

### Task 4: Training and Evaluating K-Nearest Neighbors Classifier

This task involves training a K-Nearest Neighbors (KNN) classification model using the training data 'X_train' and 'Y_train'. Subsequently, the model's performance is assessed by utilizing the cross_val_score function from sklearn.model_selection, evaluating the KNN model's accuracy across 10-fold cross-validation. This evaluation is important to gauge the KNN model's predictive performance in determining income classes within the US Census adult income data analysis.

In [None]:
#--- Import KNeighborsClassifier ---

# --- WRITE YOUR CODE FOR MODULE 4 TASK 4 ---
#knn_mean_score  = ...

#--- Inspect data ---

### Task 5: Training and Evaluating Decision Tree Classifier

This task involves training a Decision Tree Classifier using the training data 'X_train' and 'Y_train', followed by assessing the model's performance through 10-fold cross-validation. The cross_val_score function from sklearn.model_selection is used to compute the Decision Tree Classifier's mean accuracy across the 10 folds. This evaluation is crucial for understanding the Decision Tree model's performance in predicting income classes within the US Census adult income data analysis.

In [None]:
#--- Import DecisionTreeClassifier ---

# --- WRITE YOUR CODE FOR MODULE 4 TASK 5 ---
#dt_mean_score  = ...

#--- Inspect data ---

### Task 6: Training and Evaluating Gaussian Naive Bayes Classifier

This task involves training a Gaussian Naive Bayes (NB) classifier using the training data 'X_train' and 'Y_train', followed by assessing the model's performance via 10-fold cross-validation. The cross_val_score function from sklearn.model_selection is utilized to determine the Gaussian NB model's mean accuracy across the 10 folds. This evaluation is essential for comprehending the model's predictive performance in determining income classes within the US Census adult income data analysis.

In [None]:
#--- Import GaussianNB ---

# --- WRITE YOUR CODE FOR MODULE 4 TASK 6 ---
#gb_mean_score   = ...

#--- Inspect data ---

### Task 7: Model Optimization using GridSearchCV

In this task, GridSearchCV from sklearn.model_selection is utilized to optimize the Linear Discriminant Analysis (LDA) model's performance. The parameters for 'solver' are specified as 'svd', 'lsqr', and 'eigen'. GridSearchCV performs a search across the defined parameters within a 5-fold cross-validation setup using the training data ('X_train' and 'Y_train'). After the grid search, 'best_params' capture the most optimal parameters, and 'best_score' denotes the highest achieved score among the parameter combinations. Additionally, the 'best_ldr_model' variable holds the optimized LDA model with the best parameters obtained from the grid search, enhancing the model's performance within the US Census adult income data analysis.

In [None]:
#--- Import GridSearchCV ---

# --- WRITE YOUR CODE FOR MODULE 4 TASK 7 ---
#best_score   = ...

#--- Inspect data ---

### Task 8: Evaluating Model Performance on Validation Data

This task involves evaluating the performance of the selected model, identified from previous training assessments, using the validation dataset 'X_validation' and 'Y_validation'. The accuracy_score function from sklearn.metrics determines the accuracy by comparing the predicted values ('y_pred') to the actual values ('Y_validation'). Additionally, the confusion_matrix and classification_report functions from sklearn.metrics are used to generate a confusion matrix and a comprehensive classification report, respectively. These metrics provide insights into the model's predictive accuracy and misclassifications, aiding in assessing the model's performance in predicting income classes within the US Census adult income data analysis.

In [None]:
#--- Import accuracy_score, confusion_matrix, classification_report from sklearn.metrics ---


# --- WRITE YOUR CODE FOR MODULE 4 TASK 8 ---
accuracy= ...

cm= ...

cr= ...

#--- Inspect data ---

### Task 9: Evaluating Precision, Recall, and F1-Score

This task calculates the precision, recall, and F1-score metrics for the model using the validation dataset. The precision_score, recall_score, and f1_score functions from sklearn.metrics are employed to compute precision, recall, and the F1-score, respectively. These metrics offer insights into the model's precision (accuracy of positive predictions), recall (sensitivity to actual positives), and the balance between precision and recall captured by the F1-score. These calculations further aid in understanding the model's performance in predicting income classes within the US Census adult income data analysis.

In [None]:
#--- Import precision_score, recall_score, f1_score from sklearn.metrics ---


# --- WRITE YOUR CODE FOR MODULE 4 TASK 9 ---
# Calculate precision
precision = ...

# Calculate recall
recall = ...

# Calculate F1-score
F1_score = ...


#precision,recall,F1_score

#--- Inspect data ---

### Task 10: Predicting Income Class on New Data

This task involves utilizing the best-performing model determined from prior evaluations to predict the income class for new data. The new data, structured as a NumPy array, comprises features related to an individual's characteristics, such as age, fnlwgt, education years, among others. The model (which needs to be replaced by the specific best model selected previously) predicts the income class based on these features. The 'predict' method applies the model to the new data to generate the predicted income class, aiding in understanding how accurately the model can predict income status for a new individual in the context of the US Census adult income data analysis.

In [None]:
# --- WRITE YOUR CODE FOR MODULE 4 TASK 10 ---
#You need to predict the result by passing the sample data available here to your model to make a prediction.

new_data = pd.DataFrame([[74, 88638, 16, 0, 1, 0, 683, 20]],
                    columns=['age', 'fnlwgt', 'education.num', 'marital.status', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week'])
prediction = ...

#--- Inspect data ---

In [None]:
## **Income Prediction From US Census Adult Income Data**

Step into the realm of predictive analysis alongside Mia, the income predictor, in our project, "Income Prediction From US Census Adult Income Data." With Python as her language of choice and a repertoire of powerful algorithms at her disposal, Mia ventures into the world of data to unravel the complexities of income prediction.

An individual's annual income is shaped by numerous factors, such as education level, age, gender, and occupation. In this project, we delve into the task of income prediction using the US Adult Income Census dataset. Our primary goal is to accurately predict whether an individual's income falls above or below the $50,000 threshold.

The dataset at hand encompasses 16 columns, with the focal point being the "Income" field, divided into two classes: <=50K and >50K. With 14 other attributes detailing demographics and individual features, we aim to explore the potential for predicting income levels based on personal information.

Employing a suite of algorithms including Logistic Regression, KNN, LDA, Decision Trees, Gaussian NB, Random Forest, and SVC, Mia is equipped to unearth patterns in the data and make precise income predictions based on individual information.

Our project isn't merely about predicting incomes; it's a journey that unlocks the potential to understand and forecast an individual's financial standing based on a diverse range of factors.

By the end of this project, Mia doesn't just compute predictions; she uncovers the potential to enhance the understanding of income dynamics based on personal attributes.

Join Mia on this enlightening journey, where every algorithm and line of code illuminates the path to precise income predictions. Together, we'll unravel the complexities of income prediction and offer valuable insights to empower individuals and guide strategic decisions in various fields.

## Module 1
### Task 1: Loading Adult Income Data

In this task, we load the adult income data from the 'Income Prediction Adult Data.csv' file into a Pandas DataFrame named 'df.' This step is crucial for our new project, "Income Prediction From US Census Adult Income Data," as it forms the foundation for the data analysis and insights that we aim to derive from the income dataset.

#--- Import Pandas ---

#--- Read in dataset(Income Prediction Adult Data.csv) ----
df = ...

#--- Inspect data ---

### Task 2: Identifying Null Values in Adult Income Data

In this task, we identify and count the null values within the 'df' dataset. Recognizing null values is crucial to understand the completeness of the dataset. This process helps in determining whether any missing values exist within the data. Addressing or imputing null values is essential for ensuring accurate statistical analyses and model building for income prediction based on the US Census adult income data.

# --- WRITE YOUR CODE FOR MODULE 1 TASK 2 ---
sumofnull = ...

#--- Inspect data ---

### Task 3: Identifying Data Types in Adult Income Data

In this task, we determine the data types of the columns in the 'df' dataset. Understanding the data types is crucial for data analysis and preprocessing. This information aids in selecting appropriate statistical methods and feature engineering techniques that are relevant for the income prediction analysis based on the US Census adult income data.

# --- WRITE YOUR CODE FOR MODULE 1 TASK 3 ---
dtype = ...

#--- Inspect data ---

### Task 4: Categorizing Features in Adult Income Data

In this task, we categorize features within the 'df' dataset into numerical and categorical groups. Identifying numeric and categorical columns helps in defining how different features will be handled during data preprocessing and model building for income prediction based on the US Census adult income data. This categorization is essential for selecting appropriate statistical methods and preprocessing steps tailored for different types of features.

# --- WRITE YOUR CODE FOR MODULE 1 TASK 4 ---
numeric_features = ...
cat_features = ...
#--- Inspect data ---

### Task 5: Handling Missing Values in Adult Income Data

In this task, a function named 'replace_missing_values' is utilized to replace missing values denoted as '?' in the 'workclass', 'occupation', and 'native.country' columns. By replacing these missing values with a defined 'Unknown' string or a specific country ('United-States'), this process helps in handling missing or unknown data within the dataset for a more accurate and comprehensive analysis for income prediction using the US Census adult income data.

# --- WRITE YOUR CODE FOR MODULE 1 TASK 5 ---
#df = ...

#--- Inspect data ---

### Task 6: Mapping Income Categories in Adult Income Data

In this task, a dictionary 'income_map' is created to map income categories represented as '<=50K' and '>50K' to numerical values 0 and 1, respectively, within the 'income' column. This mapping is essential for creating a binary classification target variable, enabling predictive analysis for income levels in the US Census adult income data.

# --- WRITE YOUR CODE FOR MODULE 1 TASK 6 ---
#df = ...

#--- Inspect data ---

## Module 2
### Task 1: Exploring Correlations in Adult Income Data

This task involves generating a heatmap using Seaborn to visualize the correlation matrix among the numerical features specified in the 'numeric_features' list within the 'df' dataset. This heatmap provides a visual representation of how strongly each numerical feature correlates with one another. Understanding these correlations is vital for feature selection and model building in the analysis of income prediction based on the US Census adult income data.

#--- Import matplotlib and  seaborn---

# --- WRITE YOUR CODE FOR MODULE 2 TASK 1 ---
corr_data = ...

### Task 2: Visualizing Income Distribution

This task involves creating a count plot using Seaborn to illustrate the distribution of income levels within the 'income' column of the 'df' dataset. This plot helps visualize the distribution of income categories (<=50K and >50K) in the US Census adult income data. Understanding this distribution is crucial for understanding the dataset's class balance, an essential factor for accurate model training and prediction in income analysis.

# --- WRITE YOUR CODE FOR MODULE 2 TASK 2 ---
income_ax = ...

### Task 3: Comparing Education and Income
This task involves using Seaborn's catplot to create a bar chart to compare the relationship between education (represented by 'education.num') and income ('income') within the 'df' dataset. The visualization helps understand how income levels vary concerning different education levels, providing insights into potential trends or differences based on education in the US Census adult income data analysis.

# --- WRITE YOUR CODE FOR MODULE 2 TASK 3 ---
catplt_edu = ...

### Task 4: Analyzing Income and Weekly Working Hours

This task employs Seaborn's catplot to create a bar chart for exploring the relationship between income ('income') and the number of hours worked per week ('hours.per.week') within the 'df' dataset. This visualization helps to understand how income levels vary concerning the hours worked per week, providing insights into potential trends or differences based on working hours in the US Census adult income data analysis.

# --- WRITE YOUR CODE FOR MODULE 2 TASK 4 ---
catplt_hrs = ...


### Task 5: Comparing Age Distribution by Income Level

This task involves utilizing Seaborn and Matplotlib to create a FacetGrid with separate columns for each income level ('<=50K' and '>50K') to depict the distribution of ages within the 'df' dataset. By visualizing age distribution concerning income levels, this plot provides insights into how age factors into income disparities, aiding in the analysis of the US Census adult income data.

# --- WRITE YOUR CODE FOR MODULE 2 TASK 5 ---
hist_age = ...


### Task 6: Analyzing Income Based on Relationships

This task employs Seaborn's catplot to create a bar chart comparing income levels ('income') with different types of relationships ('relationship') within the 'df' dataset. This visualization aims to understand how income varies concerning various relationships, offering insights into potential income differences among different relationship statuses in the US Census adult income data analysis.

# --- WRITE YOUR CODE FOR MODULE 2 TASK 6 ---
catplt_relat = ...


### Task 7: Analyzing Income Based on Gender

This task utilizes Seaborn's catplot to create a bar chart comparing income levels ('income') based on gender ('sex') within the 'df' dataset. This visualization aims to understand how income levels differ between genders, providing insights into potential income disparities based on gender in the US Census adult income data analysis.

# --- WRITE YOUR CODE FOR MODULE 2 TASK 7 ---
catplt_sex = ...


### Task 8: Analyzing Income Based on Marital Status
This task employs Seaborn's catplot to create a bar chart comparing income levels ('income') based on marital status ('marital.status') within the 'df' dataset. This visualization aims to understand how income levels differ across different marital statuses, providing insights into potential income disparities related to marital status in the US Census adult income data analysis.

# --- WRITE YOUR CODE FOR MODULE 2 TASK 8 ---
catplt_married  = ...


### Task 9: Analyzing Income Based on Workclass

In this task, Seaborn's catplot is used to create a bar chart comparing income levels ('income') based on different work classes ('workclass') within the 'df' dataset. This visualization aims to understand how income levels differ across various work classes, providing insights into potential income disparities related to different employment categories in the US Census adult income data analysis.

# --- WRITE YOUR CODE FOR MODULE 2 TASK 9 ---
catplt_wrkcls  = ...

## Module 3
### Task 1: Data Preprocessing for Relationship and Gender

This task involves creating a modified 'data' DataFrame by encoding the 'sex' attribute from strings 'Male' and 'Female' to numerical values 0 and 1, respectively. Furthermore, it consolidates the 'marital.status' attribute into two categories 'Married' and 'Single' for improved analysis and replaces these categories with binary values 1 and 0, respectively. This preprocessing is vital for refining and simplifying the relationship and gender attributes to enhance the accuracy and interpretability of the data for further analysis in the US Census adult income data.

# --- WRITE YOUR CODE FOR MODULE 3 TASK 1 ---
data  = ...

### Task 2: Removing Certain Attributes for Data Simplification

This task involves dropping specific columns, including "workclass", "education", "occupation", "relationship", "race", and "native.country" from the 'data' DataFrame. The removal of these columns simplifies the dataset by eliminating non-essential or redundant attributes, focusing the analysis on the more relevant features for the US Census adult income data analysis.

# --- WRITE YOUR CODE FOR MODULE 3 TASK 2 ---
#data  = ...

## Module 4
### Task 1: Data Splitting for Modeling in Adult Income Data

This task involves splitting the 'data' DataFrame into training and validation sets for modeling. The feature set 'x' is generated by dropping the 'income' column, while the target variable 'y' consists solely of the 'income' column. Utilizing the train_test_split function from sklearn.model_selection, the data is split into 80% training data ('X_train' and 'Y_train') and 20% validation data ('X_validation' and 'Y_validation'). This step is essential for training and evaluating models in the US Census adult income data analysis.

#--- Import train_test_split ---
# --- WRITE YOUR CODE FOR MODULE 4 TASK 1 ---
#X_train, X_validation, Y_train, Y_validation  = ...

#--- Inspect data ---

### Task 2: Training and Evaluating Logistic Regression Model

This task involves training a Logistic Regression model using the training data 'X_train' and 'Y_train' and then assessing the model's performance using cross-validation with 10 folds. The cross_val_score function from sklearn.model_selection assesses the Logistic Regression model's performance and computes the mean accuracy score over the 10 folds. This evaluation is crucial for understanding the model's predictive capability in the US Census adult income data analysis.

#--- Import LogisticRegression ---
#--- Import cross_val_score from sklearn.model_selection ---

# --- WRITE YOUR CODE FOR MODULE 4 TASK 2 ---
#lr_mean_score  = ...

#--- Inspect data ---

### Task 3: Training and Evaluating Linear Discriminant Analysis Model

This task involves training a Linear Discriminant Analysis (LDA) model using the training data 'X_train' and 'Y_train' and then assessing the model's performance via 10-fold cross-validation. The cross_val_score function from sklearn.model_selection is used to evaluate the LDA model's accuracy across the 10 folds. This evaluation is vital for understanding the model's performance in predicting income classes within the US Census adult income data analysis.

#--- Import LinearDiscriminantAnalysis ---

# --- WRITE YOUR CODE FOR MODULE 4 TASK 3 ---
#ldr_mean_score  = ...

#--- Inspect data ---

### Task 4: Training and Evaluating K-Nearest Neighbors Classifier

This task involves training a K-Nearest Neighbors (KNN) classification model using the training data 'X_train' and 'Y_train'. Subsequently, the model's performance is assessed by utilizing the cross_val_score function from sklearn.model_selection, evaluating the KNN model's accuracy across 10-fold cross-validation. This evaluation is important to gauge the KNN model's predictive performance in determining income classes within the US Census adult income data analysis.

#--- Import KNeighborsClassifier ---

# --- WRITE YOUR CODE FOR MODULE 4 TASK 4 ---
#knn_mean_score  = ...

#--- Inspect data ---

### Task 5: Training and Evaluating Decision Tree Classifier

This task involves training a Decision Tree Classifier using the training data 'X_train' and 'Y_train', followed by assessing the model's performance through 10-fold cross-validation. The cross_val_score function from sklearn.model_selection is used to compute the Decision Tree Classifier's mean accuracy across the 10 folds. This evaluation is crucial for understanding the Decision Tree model's performance in predicting income classes within the US Census adult income data analysis.

#--- Import DecisionTreeClassifier ---

# --- WRITE YOUR CODE FOR MODULE 4 TASK 5 ---
#dt_mean_score  = ...

#--- Inspect data ---

### Task 6: Training and Evaluating Gaussian Naive Bayes Classifier

This task involves training a Gaussian Naive Bayes (NB) classifier using the training data 'X_train' and 'Y_train', followed by assessing the model's performance via 10-fold cross-validation. The cross_val_score function from sklearn.model_selection is utilized to determine the Gaussian NB model's mean accuracy across the 10 folds. This evaluation is essential for comprehending the model's predictive performance in determining income classes within the US Census adult income data analysis.

#--- Import GaussianNB ---

# --- WRITE YOUR CODE FOR MODULE 4 TASK 6 ---
#gb_mean_score   = ...

#--- Inspect data ---

### Task 7: Model Optimization using GridSearchCV

In this task, GridSearchCV from sklearn.model_selection is utilized to optimize the Linear Discriminant Analysis (LDA) model's performance. The parameters for 'solver' are specified as 'svd', 'lsqr', and 'eigen'. GridSearchCV performs a search across the defined parameters within a 5-fold cross-validation setup using the training data ('X_train' and 'Y_train'). After the grid search, 'best_params' capture the most optimal parameters, and 'best_score' denotes the highest achieved score among the parameter combinations. Additionally, the 'best_ldr_model' variable holds the optimized LDA model with the best parameters obtained from the grid search, enhancing the model's performance within the US Census adult income data analysis.

#--- Import GridSearchCV ---

# --- WRITE YOUR CODE FOR MODULE 4 TASK 7 ---
#best_score   = ...

#--- Inspect data ---

### Task 8: Evaluating Model Performance on Validation Data

This task involves evaluating the performance of the selected model, identified from previous training assessments, using the validation dataset 'X_validation' and 'Y_validation'. The accuracy_score function from sklearn.metrics determines the accuracy by comparing the predicted values ('y_pred') to the actual values ('Y_validation'). Additionally, the confusion_matrix and classification_report functions from sklearn.metrics are used to generate a confusion matrix and a comprehensive classification report, respectively. These metrics provide insights into the model's predictive accuracy and misclassifications, aiding in assessing the model's performance in predicting income classes within the US Census adult income data analysis.

#--- Import accuracy_score, confusion_matrix, classification_report from sklearn.metrics ---


# --- WRITE YOUR CODE FOR MODULE 4 TASK 8 ---
accuracy= ...

cm= ...

cr= ...

#--- Inspect data ---

### Task 9: Evaluating Precision, Recall, and F1-Score

This task calculates the precision, recall, and F1-score metrics for the model using the validation dataset. The precision_score, recall_score, and f1_score functions from sklearn.metrics are employed to compute precision, recall, and the F1-score, respectively. These metrics offer insights into the model's precision (accuracy of positive predictions), recall (sensitivity to actual positives), and the balance between precision and recall captured by the F1-score. These calculations further aid in understanding the model's performance in predicting income classes within the US Census adult income data analysis.

#--- Import precision_score, recall_score, f1_score from sklearn.metrics ---


# --- WRITE YOUR CODE FOR MODULE 4 TASK 9 ---
# Calculate precision
precision = ...

# Calculate recall
recall = ...

# Calculate F1-score
F1_score = ...


#precision,recall,F1_score

#--- Inspect data ---

### Task 10: Predicting Income Class on New Data

This task involves utilizing the best-performing model determined from prior evaluations to predict the income class for new data. The new data, structured as a NumPy array, comprises features related to an individual's characteristics, such as age, fnlwgt, education years, among others. The model (which needs to be replaced by the specific best model selected previously) predicts the income class based on these features. The 'predict' method applies the model to the new data to generate the predicted income class, aiding in understanding how accurately the model can predict income status for a new individual in the context of the US Census adult income data analysis.

# --- WRITE YOUR CODE FOR MODULE 4 TASK 10 ---
#You need to predict the result by passing the sample data available here to your model to make a prediction.

new_data = pd.DataFrame([[74, 88638, 16, 0, 1, 0, 683, 20]],
                    columns=['age', 'fnlwgt', 'education.num', 'marital.status', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week'])
prediction = ...

#--- Inspect data ---