<a href="https://colab.research.google.com/github/Niranjana-08/AI-Ascent/blob/main/notebooks/data_cleaning/data_cleaning_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Notebook Objective:**


*   This notebook takes the final merged job dataset, performs several crucial cleaning and pre-processing steps, and prepares it for machine learning classification.
*   The primary goal is to create a single, clean text column (combined_text) that consolidates the job title, skills, and description later matched with keywords_mega.py






## Mounting Drive and Loading Data

In [None]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

file_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/final_merged_jobs.csv'
final_df = pd.read_csv(file_path)

print("Dataset loaded successfully!")
final_df.head()

Mounting Google Drive...
Mounted at /content/drive
Loading the final merged dataset...
Dataset loaded successfully!


Unnamed: 0,job_id,title,description,max_salary,med_salary,min_salary,pay_period,location,company_id,formatted_work_type,formatted_experience_level,skills_desc,work_type,currency,compensation_type,company_name,company_size,speciality,employee_count,skill_name
0,921716,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,,17.0,HOURLY,"Princeton, NJ",2774458.0,Full-time,,Requirements: \n\nWe are seeking a College or ...,FULL_TIME,USD,BASE_SALARY,Corcoran Sawyer Smith,2.0,"['real estate', 'new development']",402.0,"['Marketing', 'Sales']"
1,1829192,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,,30.0,HOURLY,"Fort Collins, CO",,Full-time,,,FULL_TIME,USD,BASE_SALARY,,,[],,['Health Care Provider']
2,10998357,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,,45000.0,YEARLY,"Cincinnati, OH",64896719.0,Full-time,,We are currently accepting resumes for FOH - A...,FULL_TIME,USD,BASE_SALARY,The National Exemplar,1.0,[],15.0,"['Management', 'Manufacturing']"
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,,140000.0,YEARLY,"New Hyde Park, NY",766262.0,Full-time,,This position requires a baseline understandin...,FULL_TIME,USD,BASE_SALARY,"Abrams Fensterman, LLP",2.0,"['Civil Litigation', 'Corporate & Securities L...",222.0,['Other']
4,35982263,Service Technician,Looking for HVAC service tech with experience ...,80000.0,,60000.0,YEARLY,"Burlington, IA",,Full-time,,,FULL_TIME,USD,BASE_SALARY,,,[],,['Information Technology']


In [None]:
column_names = final_df.columns.tolist()
print(column_names)

['job_id', 'title', 'description', 'max_salary', 'med_salary', 'min_salary', 'pay_period', 'location', 'company_id', 'formatted_work_type', 'formatted_experience_level', 'skills_desc', 'work_type', 'currency', 'compensation_type', 'company_name', 'company_size', 'speciality', 'employee_count', 'skill_name']


## Processing the 'skill_name' Column

In [None]:
print(type(skill_name))

<class 'str'>


In [None]:
print(final_df['skill_name'])

0                            ['Marketing', 'Sales']
1                          ['Health Care Provider']
2                   ['Management', 'Manufacturing']
3                                         ['Other']
4                        ['Information Technology']
                            ...                    
123844            ['Legal', 'Business Development']
123845    ['Engineering', 'Information Technology']
123846            ['Sales', 'Business Development']
123847            ['Business Development', 'Sales']
123848                                ['Marketing']
Name: skill_name, Length: 123849, dtype: object


## Converting Skill Strings to Clean Text

In [None]:
import pandas as pd
import ast

final_df['cleaned_skills'] = final_df['skill_name'].apply(
    lambda x: ', '.join(ast.literal_eval(x))
)

print(final_df['cleaned_skills'])


0                            Marketing, Sales
1                        Health Care Provider
2                   Management, Manufacturing
3                                       Other
4                      Information Technology
                         ...                 
123844            Legal, Business Development
123845    Engineering, Information Technology
123846            Sales, Business Development
123847            Business Development, Sales
123848                              Marketing
Name: cleaned_skills, Length: 123849, dtype: object


In [None]:
final_df.head()

Unnamed: 0,job_id,title,description,max_salary,med_salary,min_salary,pay_period,location,company_id,formatted_work_type,...,skills_desc,work_type,currency,compensation_type,company_name,company_size,speciality,employee_count,skill_name,cleaned_skills
0,921716,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,,17.0,HOURLY,"Princeton, NJ",2774458.0,Full-time,...,Requirements: \n\nWe are seeking a College or ...,FULL_TIME,USD,BASE_SALARY,Corcoran Sawyer Smith,2.0,"['real estate', 'new development']",402.0,"['Marketing', 'Sales']","Marketing, Sales"
1,1829192,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,,30.0,HOURLY,"Fort Collins, CO",,Full-time,...,,FULL_TIME,USD,BASE_SALARY,,,[],,['Health Care Provider'],Health Care Provider
2,10998357,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,,45000.0,YEARLY,"Cincinnati, OH",64896719.0,Full-time,...,We are currently accepting resumes for FOH - A...,FULL_TIME,USD,BASE_SALARY,The National Exemplar,1.0,[],15.0,"['Management', 'Manufacturing']","Management, Manufacturing"
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,,140000.0,YEARLY,"New Hyde Park, NY",766262.0,Full-time,...,This position requires a baseline understandin...,FULL_TIME,USD,BASE_SALARY,"Abrams Fensterman, LLP",2.0,"['Civil Litigation', 'Corporate & Securities L...",222.0,['Other'],Other
4,35982263,Service Technician,Looking for HVAC service tech with experience ...,80000.0,,60000.0,YEARLY,"Burlington, IA",,Full-time,...,,FULL_TIME,USD,BASE_SALARY,,,[],,['Information Technology'],Information Technology


## Advanced Text Cleaning with NLTK

This is the core of our pre-processing pipeline. We will define a robust function to clean our text data (title, cleaned_skills, description) and then combine them into a single feature for our classification model.

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
df_processed = final_df.copy()
print("DataFrame copied successfully.")

DataFrame copied successfully.


In [None]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

### Defining the Text Cleaning Function

In [None]:
def clean_text_nltk(text):
    text = str(text)
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    words = text.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    text = ' '.join(lemmatized_words)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

### Applying the Cleaning Function and Combining Text

This function will perform several key text normalization steps:

1. Lowercase: Converts all text to lowercase for consistency.

2. Remove Symbols: Strips out any characters that are not letters.

3. Remove Stopwords: Removes common, non-informative words.

4. Lemmatize: Reduces words to their base or dictionary form (e.g., "running" becomes "run").



### Applying the Cleaning Function and Combining Text

In [None]:
print("\nApplying advanced cleaning and creating the 'combined_text' column...")

if 'cleaned_skills' not in df_processed.columns and 'skill_name' in df_processed.columns:
    df_processed['cleaned_skills'] = df_processed['skill_name'].apply(lambda skills: ' '.join(skills) if isinstance(skills, list) else '')

title_cleaned = df_processed['title'].apply(clean_text_nltk)
skills_cleaned = df_processed['cleaned_skills'].apply(clean_text_nltk)
description_cleaned = df_processed['description'].apply(clean_text_nltk)

df_processed['combined_text'] = title_cleaned + ' ' + skills_cleaned + ' ' + description_cleaned
print("'combined_text' column created successfully.")


Applying advanced cleaning and creating the 'combined_text' column...
'combined_text' column created successfully.


## Verification and Saving the Processed Data

In [None]:
print("\nDisplaying results for verification:")
pd.set_option('display.max_colwidth', None)
df_processed[['title', 'skill_name', 'description', 'combined_text']].head()


Displaying results for verification:


Unnamed: 0,title,skill_name,description,combined_text
0,Marketing Coordinator,"['Marketing', 'Sales']","Job descriptionA leading real estate firm in New Jersey is seeking an administrative Marketing Coordinator with some experience in graphic design. You will be working closely with our fun, kind, ambitious members of the sales team and our dynamic executive team on a daily basis. This is an opportunity to be part of a fast-growing, highly respected real estate brokerage with a reputation for exceptional marketing and extraordinary culture of cooperation and inclusion.Who you are:You must be a well-organized, creative, proactive, positive, and most importantly, kind-hearted person. Please, be responsible, respectful, and cool-under-pressure. Please, be proficient in Adobe Creative Cloud (Indesign, Illustrator, Photoshop) and Microsoft Office Suite. Above all, have fantastic taste and be a good-hearted, fun-loving person who loves working with people and is eager to learn.Role:Our office is a fast-paced environment. You’ll work directly with a Marketing team and communicate daily with other core staff and our large team of agents. This description is a brief overview, but your skills and interests will be considered in what you work on and as the role evolves over time.Agent Assistance- Receive & Organize Marketing Requests from Agents- Track Tasks & Communicate with Marketing team & Agents on Status- Prepare print materials and signs for open houses- Submit Orders to Printers & Communicate & Track DeadlinesGraphic Design & Branding- Managing brand strategy and messaging through website, social media, videos, online advertising, print placement and events- Receive, organize, and prioritize marketing requests from agents- Fulfill agent design requests including postcards, signs, email marketing and property brochures using pre-existing templates and creating custom designs- Maintain brand assets and generic filesEvents & Community- Plan and execute events and promotions- Manage Contacts & Vendors for Event Planning & SponsorshipsOur company is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status.Job Type: Full-time\nPay: $18-20/hour\nExpected hours: 35 – 45 per week\nBenefits:Paid time offSchedule:8 hour shiftMonday to FridayExperience:Marketing: 1 year (Preferred)Graphic design: 2 years (Preferred)Work Location: In person\n",marketing coordinator marketing sale job descriptiona leading real estate firm new jersey seeking administrative marketing coordinator experience graphic design working closely fun kind ambitious member sale team dynamic executive team daily basis opportunity part fast growing highly respected real estate brokerage reputation exceptional marketing extraordinary culture cooperation inclusion must well organized creative proactive positive importantly kind hearted person please responsible respectful cool pressure please proficient adobe creative cloud indesign illustrator photoshop microsoft office suite fantastic taste good hearted fun loving person love working people eager learn role office fast paced environment work directly marketing team communicate daily core staff large team agent description brief overview skill interest considered work role evolves time agent assistance receive organize marketing request agent track task communicate marketing team agent status prepare print material sign open house submit order printer communicate track deadlinesgraphic design branding managing brand strategy messaging website social medium video online advertising print placement event receive organize prioritize marketing request agent fulfill agent design request including postcard sign email marketing property brochure using pre existing template creating custom design maintain brand asset generic filesevents community plan execute event promotion manage contact vendor event planning sponsorshipsour company committed creating diverse environment proud equal opportunity employer qualified applicant receive consideration employment without regard race color religion gender gender identity expression sexual orientation national origin genetics disability age veteran status job type full time pay hour expected hour per week benefit paid time offschedule hour shiftmonday fridayexperience marketing year preferred graphic design year preferred work location person
1,Mental Health Therapist/Counselor,['Health Care Provider'],"At Aspen Therapy and Wellness , we are committed to serving clients with best practices to help them with change, improvements and better quality of life. We believe in providing a secure, supportive environment to grow as a clinician and learn how to foster longevity in the career which is part of our mission statement.\nThank you for taking the time to explore a career with us. We are excited to be a new group practice in the community. If you are looking for quality supervision as you work towards licensure and ability to serve populations while accepting a variety of insurance panels, we may be a good fit. Our supervisors are trained in EMDR and utilize a parts work perspective with a trauma lens.\nWe are actively looking to hire a therapist in the area who is passionate about working with adults and committed to growth and excellence in the field. We are located in Old Town Square, Fort Collins.\nWe value and are strengthened by diversity and desire a warm and welcoming place for all people. We believe in racial and ethnic equality, gender equity and social inclusion.\nPosition Requirement Possibilities:A graduate level psychological counseling-related degreeMasters of Social Work (MSW/LSW)Licensed Professional Counselor Candidate (LPCC)Clinical Social worker (LCSW)Professional Counselor (LPC)Marriage/Family Therapist (LMFT)Relating to this?Wanting to deliver high quality mental healthcareSeeking quality supervision and growth in a healthy environmentWhat we offer:Flexible work scheduleW2 Employment - commission basedBuilding to full time workJump of 5% in commission as well as monthly bonus/stipend once full timeWeekly supervision providedPaid weekly team meetings $30/hrTwo paid wellness hours/month $30/hrTelemedicine and in-person flexibilitySupportive work environment with direct access to two supervisorsAdministrative supportApproved professional development training providedFully automated EHR and technology supportStrong work/life balanceJob Duties:Conducting intake assessmentsDeveloping and implementing treatment plans for clients based on assessment and coordinating any additional services needed, revising as necessaryConducting individual sessions as appropriate for the treatment plan of the patientApplying psychotherapeutic techniques and interventions in the delivery of services to individuals for the purpose of treating emotional and behavioral disorders that have been diagnosed in assessmentParticipating in team meetings in order to staff new cases. Presenting appropriate patient information to the team. Recommending effective treatment interventions.Building and maintaining an active caseload with assigned clientsCompleting timely progress notes and treatment updates in the EHR. Maintaining all clinical documentation in accordance with regulatory and accrediting standardsProviding crisis intervention to patients in acute distress and referring as neededPerforming case management and discharge planning as neededExcellent communication and interpersonal skillsCompassionate and empathetic approach to patient carePlease send resume and cover letter to info@aspentherapyandwellness.com\nAbout Aspen Therapy and Wellness LLCAspen Therapy and Wellness is a mental health services provider focusing on work with adults in an outpatient setting, working with a variety of mental health issues both in-person in Old Town Fort Collins and throughout the state of Colorado via telehealth services.\nPlease note that this job description is not exhaustive and additional duties may be assigned as needed.",mental health therapist counselor health care provider aspen therapy wellness committed serving client best practice help change improvement better quality life believe providing secure supportive environment grow clinician learn foster longevity career part mission statement thank taking time explore career u excited new group practice community looking quality supervision work towards licensure ability serve population accepting variety insurance panel may good fit supervisor trained emdr utilize part work perspective trauma lens actively looking hire therapist area passionate working adult committed growth excellence field located old town square fort collins value strengthened diversity desire warm welcoming place people believe racial ethnic equality gender equity social inclusion position requirement possibility graduate level psychological counseling related degreemasters social work msw lsw licensed professional counselor candidate lpcc clinical social worker lcsw professional counselor lpc marriage family therapist lmft relating wanting deliver high quality mental healthcareseeking quality supervision growth healthy environmentwhat offer flexible work schedulew employment commission basedbuilding full time workjump commission well monthly bonus stipend full timeweekly supervision providedpaid weekly team meeting hrtwo paid wellness hour month hrtelemedicine person flexibilitysupportive work environment direct access two supervisorsadministrative supportapproved professional development training providedfully automated ehr technology supportstrong work life balancejob duty conducting intake assessmentsdeveloping implementing treatment plan client based assessment coordinating additional service needed revising necessaryconducting individual session appropriate treatment plan patientapplying psychotherapeutic technique intervention delivery service individual purpose treating emotional behavioral disorder diagnosed assessmentparticipating team meeting order staff new case presenting appropriate patient information team recommending effective treatment intervention building maintaining active caseload assigned clientscompleting timely progress note treatment update ehr maintaining clinical documentation accordance regulatory accrediting standardsproviding crisis intervention patient acute distress referring neededperforming case management discharge planning neededexcellent communication interpersonal skillscompassionate empathetic approach patient careplease send resume cover letter info aspentherapyandwellness com aspen therapy wellness llcaspen therapy wellness mental health service provider focusing work adult outpatient setting working variety mental health issue person old town fort collins throughout state colorado via telehealth service please note job description exhaustive additional duty may assigned needed
2,Assitant Restaurant Manager,"['Management', 'Manufacturing']","The National Exemplar is accepting applications for an Assistant Restaurant Manager.\nWe offer highly competitive wages, healthcare, paid time off, complimentary dining privileges and bonus opportunities. \nWe are a serious, professional, long-standing neighborhood restaurant with over 41 years of service. If you are looking for a long-term fit with a best in class organization then you should apply now. \nPlease send a resumes to pardom@nationalexemplar.com. o",assitant restaurant manager management manufacturing national exemplar accepting application assistant restaurant manager offer highly competitive wage healthcare paid time complimentary dining privilege bonus opportunity serious professional long standing neighborhood restaurant year service looking long term fit best class organization apply please send resume pardom nationalexemplar com
3,Senior Elder Law / Trusts and Estates Associate Attorney,['Other'],"Senior Associate Attorney - Elder Law / Trusts and Estates Our legal team is committed to providing each client with quality counsel, innovative solutions, and personalized service. Founded in 2000, the firm offers the legal expertise of its 115+ attorneys, who have accumulated experience and problem-solving skills over decades of practice.\nWe are a prominent Lake Success Law Firm seeking an associate attorney for its growing Elder Law and Estate Planning practice. The successful candidate will be a self-motivated, detail-oriented team member with strong communication skills and a desire to grow their practice. Experience with Estate Planning, Administration, and Litigation and is preferred.\n Responsibilities will include:\nCounseling clients with regard to estate planning and asset protection;Formulating and overseeing execution of Medicaid and estate plans;Drafting wills, revocable and irrevocable trusts, powers of attorney, health care proxies, and living wills;Estate Administration;Trust Administration;Court Appearances for Estate and Proceedings;Supervising paralegals \nQualifications:Juris Doctor degree (J.D.) from an accredited law schoolLicensed to practice law in New York10-15 years of experienceExperience with various advance directives, trusts, and willsStrong analytical and problem-solving skillsAbility to build rapport with clientsExcellent written and verbal communication skills\n Competitive salary commensurate with experienceSalary: $140,000- $175,000Benefits: 401k, Medical, Dental, Life Insurance, PTO, and more\nThis position is based out of Lake Success, NY",senior elder law trust estate associate attorney senior associate attorney elder law trust estate legal team committed providing client quality counsel innovative solution personalized service founded firm offer legal expertise attorney accumulated experience problem solving skill decade practice prominent lake success law firm seeking associate attorney growing elder law estate planning practice successful candidate self motivated detail oriented team member strong communication skill desire grow practice experience estate planning administration litigation preferred responsibility include counseling client regard estate planning asset protection formulating overseeing execution medicaid estate plan drafting will revocable irrevocable trust power attorney health care proxy living will estate administration trust administration court appearance estate proceeding supervising paralegal qualification juris doctor degree j accredited law schoollicensed practice law new york year experienceexperience various advance directive trust willsstrong analytical problem solving skillsability build rapport clientsexcellent written verbal communication skill competitive salary commensurate experiencesalary benefit k medical dental life insurance pto position based lake success ny
4,Service Technician,['Information Technology'],"Looking for HVAC service tech with experience in commerical and industrial equipment. Minimum 5 yrs. on the job with mechanical license. Winger is a full line union mechanical business with Piping, plumbing, sheet metal and service.",service technician information technology looking hvac service tech experience commerical industrial equipment minimum yr job mechanical license winger full line union mechanical business piping plumbing sheet metal service


In [None]:
if 'df_processed' in locals():
    base_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/data_cleaning/'
    output_path = base_path + 'cleaned_for_classification.csv'

    print(f"Saving the cleaned DataFrame to: {output_path}")
    df_processed.to_csv(output_path, index=False)
    print("\nFile saved successfully!")
else:
    print("Error: The 'df_processed' DataFrame was not found. Please ensure you have run the data cleaning and processing code first.")

Saving the cleaned DataFrame to: /content/drive/My Drive/job-analysis/job-analysis-dataset/data_cleaning/cleaned_for_classification.csv

File saved successfully!
