## Problem Statement:
**"Can we predict whether a job posting is for a remote position based on company characteristics, job details, and location information?"**

## Business Importance:
This prediction model can help:
- **Job seekers** quickly filter remote opportunities
- **Job platforms** improve their remote job classification systems
- **Companies** understand what factors make positions suitable for remote work
- **Recruiters** identify which roles are likely to be remote

In [11]:
# DATA CLEANING

print("3.1 Selecting relevant columns...")

# Keep only columns needed for remote work prediction
columns_to_keep = [
    'company_name',
    'title',
    'location',
    'job_types',
    'tags',
    'is_remote'
]

# Create cleaned dataframe with selected columns
df_clean = df[columns_to_keep].copy()
print(f"Selected {len(columns_to_keep)} columns for analysis")
print(f"Dataset shape: {df_clean.shape}")

# Check Data Types
print("\nChecking data types...")
print(df_clean.dtypes)

#  Handle Missing/Empty Information
print("\nHandling empty job information...")
################################################################################################################################
### This section here went through and changed the data from what looked liek a list    "[data]" to [data], how it should be ###

import ast
df_clean['job_types_list'] = df_clean['job_types'].apply(
    lambda x: ast.literal_eval(x) if isinstance(x, str) and x.startswith('[') else x
)
df_clean['tags_list'] = df_clean['tags'].apply(
    lambda x: ast.literal_eval(x) if isinstance(x, str) and x.startswith('[') else x
)
################################################################################################################################

# Identify rows with no job types AND no tags
no_info_mask = (df_clean['job_types_list'].apply(lambda x: x == [])) & (df_clean['tags_list'].apply(lambda x: x == []))
rows_to_drop = no_info_mask.sum()

print(f"Found {rows_to_drop} rows with no job types AND no tags")
print("Dropping completely uninformative rows...")

# Drop rows with no necessary information
df_clean = df_clean[~no_info_mask].copy()
print(f"New dataset shape: {df_clean.shape}")

# Remove temporary list columns
df_clean.drop(['job_types_list', 'tags_list'], axis=1, inplace=True)

# Final Data Check
print("\nFinal data check...")
print(f"Final dataset shape: {df_clean.shape}")
print("Missing values check:")
print(df_clean.isnull().sum())

# Save cleaned data
df_clean.to_csv('dataProject.csv', index=False)
print("Cleaned data saved to 'dataProject.csv'")

3.1 Selecting relevant columns...
Selected 6 columns for analysis
Dataset shape: (500, 6)

Checking data types...
company_name    object
title           object
location        object
job_types       object
tags            object
is_remote         bool
dtype: object

Handling empty job information...
Found 29 rows with no job types AND no tags
Dropping completely uninformative rows...
New dataset shape: (471, 8)

Final data check...
Final dataset shape: (471, 6)
Missing values check:
company_name    0
title           0
location        0
job_types       0
tags            0
is_remote       0
dtype: int64
Cleaned data saved to 'dataProject.csv'
