# COGS 118A - Final Project

# Insert title here

## Group members

- Shaoming Chen
- Yueshan Huang
- Kyle Nakai

# Abstract 
In today’s competitive job market, estimating future salary is a crucial aspect for both college students and education institutions. By machine learning techniques, our goal is to develop a predictive model that could estimate the salary of college graduates based on their academic achievements. We used a kaggle dataset that showed various features that reflect their performance in college, such as GPA, degree, or AMCAT scores. Most features are measured numerically, for those who are not in numeric form such as degree or gender, we will use one hot encoding method. With this data, we will identify the top 10 most significant features that contribute to the salary and we will use linear regression to train the data sets. Our model's performance will be masured based on MSE (Mean Squared Error).






Rubrics: This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents 
- the solution/what you did
- major results you came up with (mention how results are measured) 

__NB:__ this final project form is much more report-like than the proposal and the checkpoint. Think in terms of writing a paper with bits of code in the middle to make the plots/tables

# Background

College students' salaries are influenced by various factors, including gender, college rank, major choice, and standardized test scores such as the AMCAT. Understanding the relationship between these factors and salary outcomes is crucial for informed decision-making and addressing potential disparities.

Gender plays<a name="hess"></a>[<sup>[1]</sup>](#hessnote) a significant role in salary discrepancies among college graduates. Despite advancements in gender equality, women often face lower wages compared to men. Analyzing salary data can provide insights into the extent of these disparities, helping policymakers develop strategies to promote pay equity and fair employment practices.

The prestige of the college<a name="pic"></a>[<sup>[2]</sup>](#picnote) attended also impacts salary prospects. Higher-ranked colleges often offer better resources, networking opportunities, and career services, which can translate into higher starting salaries. Exploring the connection between college rank and salaries can inform educational institutions about the value of enhancing career support for students and strengthening alumni networks.
The choice of major is another influential factor. Certain fields, such as STEM (Science, Technology, Engineering, and Mathematics) disciplines, typically yield higher-paying job opportunities. Evaluating the correlation between major choice and salary outcomes can guide students in making informed educational and career decisions while enabling educators to identify areas for curriculum enhancement and support in fields with lower earning potentials.

Additionally, standardized tests like the AMCAT score can impact salary prospects. The AMCAT measures job applicants' aptitude and skills, serving as a benchmark for employability. Analyzing the relationship between AMCAT scores and salaries can help students understand the potential significance of test performance on future earnings.

Machine learning techniques<a name="Mar"></a>[<sup>[3]</sup>](#Marnote) can effectively analyze relevant datasets to derive valuable insights. Predictive modeling can forecast salary outcomes based on factors such as gender, college rank, major, and AMCAT scores. Feature importance analysis can identify the relative significance of these factors in salary determination. Machine learning algorithms can also be employed for clustering and segmentation, helping identify patterns and target specific groups for interventions. By leveraging machine learning, policymakers and educators can gain actionable information to promote fair salary practices, address disparities, and support informed decision-making for college students.

# Problem Statement
In this project, we aim to address a research problem related to predicting the salary of recently graduated engineers. Our research question revolves around investigating the predictability of salary using various data variables, including college GPA, gender, college tier, 12th-grade marks, and potentially other relevant factors. By employing a linear regression model, our objective is to develop a model that can accurately estimate the salary of individuals as the output based on these factors.

 The variables involved in the study, such as college GPA, gender, college tier, and 12th-grade marks, have been appropriately defined, providing a solid foundation for the analysis.  Furthermore, the potential replicability of the research problem is addressed by employing a linear regression model, which is a well-established and widely used technique in statistical analysis, and the k-fold cross-validation method can be used in different datasets with similar column variables. The use of multiple validation sets proves the reliability of the machine learning method.




Rubrics: Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

# Data

- Dataset: https://www.kaggle.com/datasets/manishkc06/engineering-graduate-salary-prediction?resource=download 
- 2998 observations and 34 variables
- Variables:
    - ID: A unique ID to identify a candidate
    - Salary: Annual CTC offered to the candidate (in INR)
    - Gender: Candidate's gender
    - DOB: Date of birth of the candidate
    - 10percentage: Overall marks obtained in grade 10 examinations
    - 10board: The school board whose curriculum the candidate followed in grade 10
    - 12graduation: Year of graduation - senior year high school
    - 12percentage: Overall marks obtained in grade 12 examinations
    - 12board: The school board whose curriculum the candidate followed
    - CollegeID: Unique ID identifying the university/college which the candidate attended for her/his undergraduate
    - CollegeTier: Each college has been annotated as 1 or 2. The annotations have been computed from the average AMCAT scores obtained by the - students in the college/university. Colleges with an average score above a threshold are tagged as 1 and others as 2.
    - Degree: Degree obtained/pursued by the candidate
    - Specialization: Specialization pursued by the candidate
    - CollegeGPA: Aggregate GPA at graduation
    - CollegeCityID: A unique ID to identify the city in which the college is located in.
    - CollegeCityTier: The tier of the city in which the college is located in. This is annotated based on the population of the cities.
    - CollegeState: Name of the state in which the college is located
    - GraduationYear: Year of graduation (Bachelor's degree)
    - English: Scores in AMCAT English section
    - Logical: Score in AMCAT Logical ability section
    - Quant: Score in AMCAT's Quantitative ability section
    - Domain: Scores in AMCAT's domain module
    - ComputerProgramming: Score in AMCAT's Computer programming section
    - ElectronicsAndSemicon: Score in AMCAT's Electronics & Semiconductor Engineering section
    - ComputerScience: Score in AMCAT's Computer Science section
    - MechanicalEngg: Score in AMCAT's Mechanical Engineering section
    - ElectricalEngg: Score in AMCAT's Electrical Engineering section
    - TelecomEngg: Score in AMCAT's Telecommunication Engineering section
    - CivilEngg: Score in AMCAT's Civil Engineering section
    - conscientiousness: Scores in one of the sections of AMCAT's personality test
    - agreeableness: Scores in one of the sections of AMCAT's personality test
    - extraversion: Scores in one of the sections of AMCAT's personality test
    - nueroticism: Scores in one of the sections of AMCAT's personality test
    - openess_to_experience: Scores in one of the sections of AMCAT's personality test

- We are particularly interested in the score variables such as CollegeGPA and 12percentage which are represented as floats
- We will need to clean some of the data in the following ways:
    - Convert categorical variables to one-hot encoding
    - Normalize the numerical values for accurate comparison# Data




Detail how/where you obtained the data and cleaned it (if necessary)

If the data cleaning process is very long (e.g., elaborate text processing) consider describing it briefly here in text, and moving the actual clearning process to another notebook in your repo (include a link here!).  The idea behind this approach: this is a report, and if you blow up the flow of the report to include a lot of code it makes it hard to read.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!


# Proposed Solution

Python will be used as the platform for presenting solutions. Meanwhile different libraries should be implemented for the data processing, training and evaluating the dataset, such includes scikit-learn, pandas, and numpy

There will be several steps to make the solution to this problem. 

Firstly, raw data is cleaned by some pre-processing steps, including encoding categorical variables, deleting extra columns, handling null values in the dataset, and normalizing outliers and other numerical values. 

Then we choose to use a multi-variable linear regression model to predict the outcome of the students. The reason we use linear regression instead of logistic regression is that the desired outcome is a numerical value instead of a binary variable. The goal is to derive an expression for the salary with given input factors labelled in the columns. Also since our data includes more than one columns, the output value is not a one-to-one relationship. Therefore, we need a multivariable expression for our training model.
Apart from the traditional linear regression model, we also chose to use SVM for our algorithem. There are several resons to that. SVM is a linear regression model that can be fit to our data manipulation and also it is compatible for multivariable complex problem. Besides, it is robust against overfitting and outliers that would give the optimal decision boundary that generalizes well to unseen data. And it is friendly and effecient to small datasets. 

Finally, we choose to use k-fold cross validation to train the model. The reason we choose this method is to effectively shuffle our data and maximizes the utilization of available data as our dataset is not huge. Cross-validation also provides a more robust estimate of a model's performance compared to a single train-test split. By averaging the results from multiple iterations, it reduces the impact of the specific data points in a single split on the performance metrics, leading to a more representative estimate.






Rubrics: In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics
In our problem, one reasonable evaluation metric to use would be MSE (Mean Squared Error). MSE is considered a rather common metric for evaluating the performance of a regression model. This model measures the average squared difference between the predicted values of the model and the actual values from the dataset. Here is a mathematical representatin of the MSE: MSE = (1/n) * Σ(yi - ŷi)^2. N is the number of data points in total; yi refers to the actual value of the variable for the i-th data point, and ŷi refers to the predicted value of the variable for the i-th data point.

Since the MSE is always greater than zero, a lower MSE indicates a better performance for the model, as the predicted value is closer to the real value in the dataset. It provides a measure of how well the model fits the data and how close the predicted values are to the actual values. However, since MSE is squared, it penalizes larger errors more heavily than smaller errors.

Morever, k-folds cross validation can be used along with MSE. By shuffling the training data and evaluated k-times, overfitting and accuracy of the model can be trained at best. The MSE values in our cross-validation can be averaged to provide an overall assessment of the model's performance.



Rubrics:L Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Subsection 1

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?



# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy

This dataset is publicly available thanks to 'Aspiring Minds Research', and provides income and educational details about recent graduates from Indian engineering and technology institutions. There are always concerns about privacy and exposure when working with data collected from indiviudals, however in this case our dataset does not contain much personally identifiable information so their is less risk of people being targeted and harmed. The only PII is date of birth but without any other information this is not very valuable. We also do interact with the participants' actual income as well as educational information such as GPA and exam scores. Hopefully the anonymity of the samples will protect the participants in this regard but we must still be cautious when handling data such as this. We must also consider potential biases in the data that could arise from factors such as gender, age, location, and other circumstances. We will deal with potential privacy concerns by ensuring the data is solely used for the purposes outlined by our project. 

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="hessnote"></a>1.[^](#hess)Hess, A (Jan 20) Survey of 563,000 recent college grads finds gender pay gap already impacting class of 2020 https://www.cnbc.com/2022/01/20/gender-pay-gap-for-class-of-2020-starting-salaries-shown-in-new-report.html. <br>
<a name="picnote"></a>2.[^](#pic)Picchi, A (March 2,2023) Your college major can influence your pay. Here are the best and worst majors. https://www.cbsnews.com/news/college-major-highest-lowest-incomes/ <br>
<a name="Marnote"></a>3.[^](#Mar)Martin, N (July, 2018) Salary Prediction in the IT Job Market with Few High-Dimensional Samples: A Spanish Case Study International Journal of Computational Intelligence Systems, 11(1):1192