# COGS 118A - Project Checkpoint

# Names

- Kyle Nakai
- Yueshan Huang
- Shaoming Chen

# Abstract 
In today’s competitive job market, estimating future salary is a crucial aspect for both college students and education institutions. By machine learning techniques, our goal is to develop a predictive model that could estimate the salary of college graduates based on their academic achievements. We used a kaggle dataset that showed various features that reflect their performance in college, such as GPA, degree, or AMCAT scores. Most features are measured numerically, for those who are not in numeric form such as degree or gender, we will use one hot encoding method. With this data, we will identify the top 10 most significant features that contribute to the salary and we will use linear regression to train the data sets. Our model's performance will be masured based on MSE (Mean Squared Error).

# Background

College students' salaries are influenced by various factors, including gender, college rank, major choice, and standardized test scores such as the AMCAT. Understanding the relationship between these factors and salary outcomes is crucial for informed decision-making and addressing potential disparities.

Gender plays<a name="hess"></a>[<sup>[1]</sup>](#hessnote) a significant role in salary discrepancies among college graduates. Despite advancements in gender equality, women often face lower wages compared to men. Analyzing salary data can provide insights into the extent of these disparities, helping policymakers develop strategies to promote pay equity and fair employment practices.

The prestige of the college<a name="pic"></a>[<sup>[2]</sup>](#picnote) attended also impacts salary prospects. Higher-ranked colleges often offer better resources, networking opportunities, and career services, which can translate into higher starting salaries. Exploring the connection between college rank and salaries can inform educational institutions about the value of enhancing career support for students and strengthening alumni networks.
The choice of major is another influential factor. Certain fields, such as STEM (Science, Technology, Engineering, and Mathematics) disciplines, typically yield higher-paying job opportunities. Evaluating the correlation between major choice and salary outcomes can guide students in making informed educational and career decisions while enabling educators to identify areas for curriculum enhancement and support in fields with lower earning potentials.

Additionally, standardized tests like the AMCAT score can impact salary prospects. The AMCAT measures job applicants' aptitude and skills, serving as a benchmark for employability. Analyzing the relationship between AMCAT scores and salaries can help students understand the potential significance of test performance on future earnings.

Machine learning techniques<a name="Mar"></a>[<sup>[3]</sup>](#Marnote) can effectively analyze relevant datasets to derive valuable insights. Predictive modeling can forecast salary outcomes based on factors such as gender, college rank, major, and AMCAT scores. Feature importance analysis can identify the relative significance of these factors in salary determination. Machine learning algorithms can also be employed for clustering and segmentation, helping identify patterns and target specific groups for interventions. By leveraging machine learning, policymakers and educators can gain actionable information to promote fair salary practices, address disparities, and support informed decision-making for college students.

# Problem Statement

In this project we want to see if we can predict the salary of recently graduated engineers based on data such as their college GPA, gender, college tier, 12th grade marks, and other variables. We hope to train a regression model on our dataset and be able to accurately estimate salary.

# Data

- Dataset: https://www.kaggle.com/datasets/manishkc06/engineering-graduate-salary-prediction?resource=download 
- 2998 observations and 34 variables
- Variables:
    - ID: A unique ID to identify a candidate
    - Salary: Annual CTC offered to the candidate (in INR)
    - Gender: Candidate's gender
    - DOB: Date of birth of the candidate
    - 10percentage: Overall marks obtained in grade 10 examinations
    - 10board: The school board whose curriculum the candidate followed in grade 10
    - 12graduation: Year of graduation - senior year high school
    - 12percentage: Overall marks obtained in grade 12 examinations
    - 12board: The school board whose curriculum the candidate followed
    - CollegeID: Unique ID identifying the university/college which the candidate attended for her/his undergraduate
    - CollegeTier: Each college has been annotated as 1 or 2. The annotations have been computed from the average AMCAT scores obtained by the - students in the college/university. Colleges with an average score above a threshold are tagged as 1 and others as 2.
    - Degree: Degree obtained/pursued by the candidate
    - Specialization: Specialization pursued by the candidate
    - CollegeGPA: Aggregate GPA at graduation
    - CollegeCityID: A unique ID to identify the city in which the college is located in.
    - CollegeCityTier: The tier of the city in which the college is located in. This is annotated based on the population of the cities.
    - CollegeState: Name of the state in which the college is located
    - GraduationYear: Year of graduation (Bachelor's degree)
    - English: Scores in AMCAT English section
    - Logical: Score in AMCAT Logical ability section
    - Quant: Score in AMCAT's Quantitative ability section
    - Domain: Scores in AMCAT's domain module
    - ComputerProgramming: Score in AMCAT's Computer programming section
    - ElectronicsAndSemicon: Score in AMCAT's Electronics & Semiconductor Engineering section
    - ComputerScience: Score in AMCAT's Computer Science section
    - MechanicalEngg: Score in AMCAT's Mechanical Engineering section
    - ElectricalEngg: Score in AMCAT's Electrical Engineering section
    - TelecomEngg: Score in AMCAT's Telecommunication Engineering section
    - CivilEngg: Score in AMCAT's Civil Engineering section
    - conscientiousness: Scores in one of the sections of AMCAT's personality test
    - agreeableness: Scores in one of the sections of AMCAT's personality test
    - extraversion: Scores in one of the sections of AMCAT's personality test
    - nueroticism: Scores in one of the sections of AMCAT's personality test
    - openess_to_experience: Scores in one of the sections of AMCAT's personality test

- We are particularly interested in the score variables such as CollegeGPA and 12percentage which are represented as floats
- We will need to clean some of the data in the following ways:
    - Convert categorical variables to one-hot encoding
    - Normalize the numerical values for accurate comparison

# Proposed Solution

We will perform multi-class classification in logistic regression model to estimate the total salary of a student based on various features of shown in dataset. Since there are multiple categories to describe the status of a student, we can train the dataset across different variables and then estimate the final result in salary column. Also, we can use the one-hot encoding method to include the categorical variables in our analysis. Other model can also be applied to our machine learning. The MSE score is the main evaluation matrics in the solution. When incorporated in the k-folds cross validation, the mean of the MSE across the datasets can be used to present the overall performance of the model. Other algorithem can be applied to this dataset as well. K Nearest Neighbor, decision tree and SVM can also be used to determine a more robust model to generalize the datasets in different perspectives.

# Evaluation Metrics


This dataset is publicly available thanks to 'Aspiring Minds Research', and provides income and educational details about recent graduates from Indian engineering and technology institutions. There are always concerns about privacy and exposure when working with data collected from indiviudals, however in this case our dataset does not contain much personally identifiable information so their is less risk of people being targeted and harmed. The only PII is date of birth but without any other information this is not very valuable. We also do interact with the participants' actual income as well as educational information such as GPA and exam scores. Hopefully the anonymity of the samples will protect the participants in this regard but we must still be cautious when handling data such as this. We must also consider potential biases in the data that could arise from factors such as gender, age, location, and other circumstances. We will deal with potential privacy concerns by ensuring the data is solely used for the purposes outlined by our project. 

# Preliminary results

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



# Ethics & Privacy


This dataset is publicly available thanks to 'Aspiring Minds Research', and provides income and educational details about recent graduates from Indian engineering and technology institutions. There are always concerns about privacy and exposure when working with data collected from indiviudals, however in this case our dataset does not contain much personally identifiable information so their is less risk of people being targeted and harmed. The only PII is date of birth but without any other information this is not very valuable. We also do interact with the participants' actual income as well as educational information such as GPA and exam scores. Hopefully the anonymity of the samples will protect the participants in this regard but we must still be cautious when handling data such as this. We must also consider potential biases in the data that could arise from factors such as gender, age, location, and other circumstances. We will deal with potential privacy concerns by ensuring the data is solely used for the purposes outlined by our project. 

# Team Expectations 

* *Respect each other*
* *Finish the work and join meetings on time*
* *Support each other, give help if you can*
* *Check messages on Discord in a timely manner*
* *Be cheerful, positive and encouraging to other team members*

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/17  |  8 PM |  Finish rough draft on project proposal  | finalize the project proposal and make sure to submit on time | 
| 5/24  |  8 PM |  Read the feedback from TA and peers | Discuss on the feedbacks, think of ways to fix the issues, split the work accordingly | 
| 5/29  |  8 PM | Finish individual work split  | Review eachother's work and discuss how to improve further, split the work accordingly |
| 5/31  |  8 PM | Fix the issues and finish rough draft on project check point | finalize the project proposal and make sure to submit on time |
| 6/08   |  8 PM  | Think of ways to improve our project and which part isn't satisfying | Share thoughts and discuss the feasability of each group members ideas, split the work accordingly |
| 6/12  |  8 PM  | Finish individual work split| discuss on how to fix and improve our project |
| 6/14  |  8 PM  | Finish the final project | Turn in Final Project  |

# Footnotes
<a name="hessnote"></a>1.[^](#hess)Hess, A (Jan 20) Survey of 563,000 recent college grads finds gender pay gap already impacting class of 2020 https://www.cnbc.com/2022/01/20/gender-pay-gap-for-class-of-2020-starting-salaries-shown-in-new-report.html. <br>
<a name="picnote"></a>2.[^](#pic)Picchi, A (March 2,2023) Your college major can influence your pay. Here are the best and worst majors. https://www.cbsnews.com/news/college-major-highest-lowest-incomes/ <br>
<a name="Marnote"></a>3.[^](#Mar)Martin, N (July, 2018) Salary Prediction in the IT Job Market with Few High-Dimensional Samples: A Spanish Case Study International Journal of Computational Intelligence Systems, 11(1):1192
