<h4>I used Claude, Anthropic's model to help with the code portion of this assignment.</h4>
<h4>I used Copilot for help understanding these concepts.</h4>

<h3>1) Collect data from a reputable source in order use both a simple regression and multiple regression to make prediction and generate prediction intervals in a Colab notebook. You can collect multiple independent variables, or you can "make" the extra independent variable by using a time trend like the video, interactions, or categorical variables. Do not use simulated data for this assignment. Motivate your analysis by explaining why someone might be interested in predicting the dependent variable you use. Please reach out to me if you need to brainstorm data to collect. Come up with some interesting sets of independent variables to make predictions with, and present both the point prediction and associated prediction interval.</h3>



<h3>(2) Make a plot showing the predicted regression line from your simple regression with the dependent variable on the y-axis and independent variable on the x-axis. Also display prediction interval lines. (The easiest way to do this is probably starting by generating x-values from x-min to x-max, then getting predicted y's for that set. Then generate lower and upper prediction interval bounds for each x in the set. Note that the statistics, e.g., /bar{x}, should still be calculated using the collected data.)</h3>

<p>I found a dataset a kaggle user put together by scraping online forms for job postings related to data science.</p>
<a href="https://www.kaggle.com/datasets/elahehgolrokh/data-science-job-postings-with-salaries-2025">2025 Data Science Job Postings</a>

<ul>
<li>job_title → The main keyword in the title of the posted position (e.g., Data Scientist, ML Engineer).
</li>
<li>seniority_level → Seniority of the role (e.g., Junior, Senior, Lead).
</li>
<li>status → Work arrangement type (Remote, Hybrid, On-site).
</li>
<li>company → Anonymized company identifier.
</li>
<li>location → Job location(s) mentioned in the posting.
</li>
<li>post_date → When the job posting was listed.
</li>
<li>headquarter → Location of the company’s headquarters.
</li>
<li>industry → Industry sector of the company (e.g., Finance, Technology).
</li>
<li>ownership → Ownership type (e.g., Public, Private).
</li>
<li>company_size → Number of employees in the company.
</li>
<li>revenue → Reported revenue of the company (if available).
</li>
<li>salary → Annual salary in Euro, expressed as min-max or a single value
</li>
<li>skills → Extracted list of required or preferred skills (e.g., Python, SQL, Spark).</li>
</ul>

In [17]:
import pandas as pd
import re

In [47]:
df = pd.read_csv('../datasets/assignment8/data_science_job_posts_2025.csv')
df_features = df.columns

In [48]:
# Sample exchange rate (you can update this dynamically)
EUR_TO_USD = 1.17

def convert_revenue(value):
    value = str(value)  # Ensure it's a string
    match = re.match(r'€([\d\.]+)([MBT])', value)
    if match:
        num, unit = match.groups()
        multiplier = {'M': 1e6, 'B': 1e9, 'T': 1e12}[unit]
        usd_value = float(num) * multiplier * EUR_TO_USD
        return f"${usd_value:,.2f}"
    return value  # Return original if not matching

def convert_salary(value):
    value = str(value)  # Ensure it's a string
    parts = re.findall(r'€[\d,]+', value)
    if parts:
        converted = []
        for part in parts:
            num = float(part.replace('€', '').replace(',', ''))
            usd = num * EUR_TO_USD
            converted.append(f"${usd:,.0f}")
        return " - ".join(converted)
    return value
# def clean_company_size(value):
#     value = str(value)  # Ensure it's a string
#     match = re.search(r'\d+', value.replace(',', ''))
#     return int(match.group()) if match else None


In [None]:
df['revenue'] = df['revenue'].fillna("").apply(convert_revenue)
df['salary'] = df['salary'].fillna("").apply(convert_salary)
df = df[df['company_size'].astype(str).str.match(r'^[\d,]+$')] # I lost 40 rows since company size had erroneous euro signs and I didn't want to regex them...

In [53]:
df.head(5)

Unnamed: 0,job_title,seniority_level,status,company,location,post_date,headquarter,industry,ownership,company_size,revenue,salary,skills
1,data scientist,lead,hybrid,company_005,"Fort Worth, TX . Hybrid",15 days ago,"Detroit, MI, US",Manufacturing,Public,155030,"$59,787,000,000.00","$138,918","['spark', 'r', 'python', 'sql', 'machine learn..."
2,data scientist,senior,on-site,company_007,"Austin, TX . Toronto, Ontario, Canada . Kirkla...",a month ago,"Redwood City, CA, US",Technology,Public,25930,"$39,546,000,000.00","$111,135 - $186,684","['aws', 'git', 'python', 'docker', 'sql', 'mac..."
3,data scientist,senior,hybrid,company_008,"Chicago, IL . Scottsdale, AZ . Austin, TX . Hy...",8 days ago,"San Jose, CA, US",Technology,Public,34690,"$95,600,700,000.00","$131,972 - $227,450","['sql', 'r', 'python']"
4,data scientist,,on-site,company_009,On-site,3 days ago,"Stamford, CT, US",Finance,Private,1800,Private,"$133,581 - $267,154",[]
5,data scientist,lead,,company_013,"New York, NY",3 months ago,"New York, NY, US",Technology,Private,150,"$2,527,200,000.00","$229,754 - $293,869","['scikit-learn', 'python', 'scala', 'sql', 'ma..."


In [54]:
df.count

<bound method DataFrame.count of                      job_title seniority_level   status      company  \
1               data scientist            lead   hybrid  company_005   
2               data scientist          senior  on-site  company_007   
3               data scientist          senior   hybrid  company_008   
4               data scientist             NaN  on-site  company_009   
5               data scientist            lead      NaN  company_013   
..                         ...             ...      ...          ...   
939             data scientist          senior      NaN  company_171   
940  machine learning engineer          senior      NaN  company_134   
941             data scientist        midlevel  on-site  company_395   
942             data scientist        midlevel  on-site  company_395   
943             data scientist          senior  on-site  company_844   

                                              location     post_date  \
1                             