1. companies.csv

The companies.csv dataset consists of the following columns:

company_id: A unique identifier for each company.
name: The name of the company.
description: A description of the company.
company_size: A numerical value representing the size of the company.
state: The state in which the company is located.
country: The country in which the company is located.
city: The city in which the company is located.
zip_code: The zip code of the company's location.
address: The address of the company.
url: The LinkedIn URL of the company.

Preliminary Observations:
There are placeholder values like "0" and "-" for missing data in the state, zip_code, and address columns. We need to handle these appropriately.
The company_size is a numerical column, but we should check for outliers or incorrect entries.
We need to check for missing values and duplicates.
Next Steps:
Check for missing values.
Check for duplicates.
Handle the placeholder values in state, zip_code, and address.
Analyze the company_size column for any anomalies or outliers.
Check the data types of each column and convert them as needed.

In [18]:
import pandas as pd

# Load the first dataset
companies_df = pd.read_csv('companies.csv')

# Display the first few rows of the DataFrame
companies_df.head()

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url
0,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm
1,1016,GE HealthCare,Every day millions of people feel the impact o...,7.0,0,US,Chicago,0,-,https://www.linkedin.com/company/gehealthcare
2,1021,GE Power,"GE Power, part of GE Vernova, is a world energ...",7.0,NY,US,Schenectady,12345,1 River Road,https://www.linkedin.com/company/gepower
3,1025,Hewlett Packard Enterprise,Official LinkedIn of Hewlett Packard Enterpris...,7.0,Texas,US,Houston,77389,1701 E Mossy Oaks Rd Spring,https://www.linkedin.com/company/hewlett-packa...
4,1028,Oracle,We’re a cloud technology company that provides...,7.0,Texas,US,Austin,78741,2300 Oracle Way,https://www.linkedin.com/company/oracle


In [19]:
# Check for missing values in the dataset
companies_missing_values = companies_df.isnull().sum()

companies_missing_values


company_id        0
name              0
description      64
company_size    589
state             3
country           0
city              0
zip_code          8
address           5
url               0
dtype: int64

We have missing values in the following columns:

description: 64 missing values
company_size: 589 missing values
state: 3 missing values
zip_code: 8 missing values
address: 5 missing values
Strategies to Handle Missing Values:
Description:

Given that this is a text column, we can either leave it as is or fill the missing values with a placeholder like "Not Available."
Company Size:

We can consider filling the missing values with the median or mean company size, or categorize them as unknown.
State, Zip Code, and Address:

These are location-based columns. We could try to impute these based on the other location information available or categorize them as unknown.
L

check for duplicate rows in the dataset

In [20]:
# Check for duplicate rows
duplicate_rows = companies_df[companies_df.duplicated()]
num_duplicate_rows = duplicate_rows.shape[0]

num_duplicate_rows, duplicate_rows.head()  # Display the number of duplicate rows and the first few rows of duplicates if any


(0,
 Empty DataFrame
 Columns: [company_id, name, description, company_size, state, country, city, zip_code, address, url]
 Index: [])

By handling the missing values in the description column by filling them with a placeholder like "Not Available". For the company_size column, we will use the median value to fill the missing values, as it is a numerical column. We'll replace missing values and placeholders in state, zip_code, and address with "Unknown".

In [22]:

# Fill missing values in the 'description' column with 'Not Available'
companies_df['description'].fillna('Not Available', inplace=True)

# Fill missing values in the 'company_size' column with the median value
median_size = companies_df['company_size'].median()
companies_df['company_size'].fillna(median_size, inplace=True)

# Replace placeholder values and fill missing values in 'state', 'zip_code', and 'address' with 'Unknown'
columns_with_placeholders = ['state', 'zip_code', 'address']
companies_df[columns_with_placeholders] = companies_df[columns_with_placeholders].replace(['0', '-'], 'Unknown')
companies_df.fillna('Unknown', inplace=True)

# Check if there are still any missing values
companies_df.isnull().sum()


company_id      0
name            0
description     0
company_size    0
state           0
country         0
city            0
zip_code        0
address         0
url             0
dtype: int64

check the data types of each column in the companies.csv dataset 

In [23]:
# Check the data types of each column
companies_df.dtypes

company_id        int64
name             object
description      object
company_size    float64
state            object
country          object
city             object
zip_code         object
address          object
url              object
dtype: object

company_id: Integer
name: String (Object)
description: String (Object)
company_size: Float
state: String (Object)
country: String (Object)
city: String (Object)
zip_code: String (Object)
address: String (Object)
url: String (Object)

Observations:
The company_id is an integer, which is appropriate since it’s an identifier.
The company_size is a float; we might want to consider whether an integer type would be more appropriate.
Other columns like name, description, state, country, city, zip_code, address, and url are objects (strings), which is suitable for textual data.
Actions to Consider:
We might consider converting the company_size to an integer type since having a float type might not make sense for counting the number of employees or size categories.
Ensure that textual data is cleaned and standardized, for instance, ensuring consistent capitalization.

converting the company_size column to an integer data type


In [24]:
# Convert the 'company_size' column to integer type
companies_df['company_size'] = companies_df['company_size'].astype(int)

# Check the data types again to confirm the conversion
companies_df.dtypes


company_id       int64
name            object
description     object
company_size     int64
state           object
country         object
city            object
zip_code        object
address         object
url             object
dtype: object

The company_size column has been successfully converted to an integer type.

Convert the textual data in the name, description, state, country, city, and address columns to lowercase to ensure consistency

In [25]:
# List of columns to convert to lowercase
text_columns = ['name', 'description', 'state', 'country', 'city', 'address']

# Convert the textual data to lowercase
companies_df[text_columns] = companies_df[text_columns].applymap(str.lower)

# Display the first few rows of the DataFrame to confirm the changes
companies_df.head()

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url
0,1009,ibm,"at ibm, we do more than work. we create. we cr...",7,ny,us,"armonk, new york",10504,international business machines corp.,https://www.linkedin.com/company/ibm
1,1016,ge healthcare,every day millions of people feel the impact o...,7,unknown,us,chicago,Unknown,unknown,https://www.linkedin.com/company/gehealthcare
2,1021,ge power,"ge power, part of ge vernova, is a world energ...",7,ny,us,schenectady,12345,1 river road,https://www.linkedin.com/company/gepower
3,1025,hewlett packard enterprise,official linkedin of hewlett packard enterpris...,7,texas,us,houston,77389,1701 e mossy oaks rd spring,https://www.linkedin.com/company/hewlett-packa...
4,1028,oracle,we’re a cloud technology company that provides...,7,texas,us,austin,78741,2300 oracle way,https://www.linkedin.com/company/oracle


2. Benefits.csv

The benefits.csv file contains the following columns:

job_id: An identifier for each job.
inferred: A binary indicator (0 or 1), though it's not clear from the dataset preview what this column represents. We may need more context or information to understand and handle this column appropriately.
type: The type of benefit offered.
Here are some steps to consider for cleaning this dataset:

Checking for missing values.
Understanding and handling the inferred column, if necessary.
Cleaning or categorizing the type column to ensure consistency in the benefit names/types.


In [5]:
# Load the second dataset
benefits_df = pd.read_csv('benefits.csv')

# Display the first few rows of the dataframe
benefits_df.head()

Unnamed: 0,job_id,inferred,type
0,3690843087,0,Medical insurance
1,3690843087,0,Dental insurance
2,3690843087,0,401(k)
3,3690843087,0,Paid maternity leave
4,3690843087,0,Disability insurance


In [6]:
# Check for missing values in the benefits dataset
benefits_missing_values = benefits_df.isnull().sum()
benefits_missing_values


job_id      0
inferred    0
type        0
dtype: int64

In [7]:
# Remove the inferred column
benefits_df.drop(columns=['inferred'], inplace=True)

# Display the first few rows of the dataframe to verify the changes
benefits_df.head()

Unnamed: 0,job_id,type
0,3690843087,Medical insurance
1,3690843087,Dental insurance
2,3690843087,401(k)
3,3690843087,Paid maternity leave
4,3690843087,Disability insurance


In [8]:
# Display unique values in the 'type' column
benefits_unique_types = benefits_df['type'].unique()
benefits_unique_types


array(['Medical insurance', 'Dental insurance', '401(k)',
       'Paid maternity leave', 'Disability insurance', 'Vision insurance',
       'Tuition assistance', 'Pension plan', 'Paid paternity leave',
       'Commuter benefits', 'Student loan assistance',
       'Child care support'], dtype=object)

3. job_industries.csv

The job_industries.csv file contains the following columns:

job_id: An identifier for each job.
industry_id: An identifier for each industry.
The dataset seems straightforward. Here are the cleaning steps we might consider:

Checking for missing values.
Ensuring that the IDs are consistent and meaningful (we might need additional data to map IDs to actual industry names).


In [10]:
# Load the third dataset
job_industries_df = pd.read_csv('job_industries.csv')

# Display the first few rows of the dataframe
job_industries_df.head()


Unnamed: 0,job_id,industry_id
0,3378133231,68
1,3497509795,96
2,3690843087,47
3,3691775263,112
4,3691779379,80


In [11]:
# Check for missing values in the job_industries dataset
job_industries_missing_values = job_industries_df.isnull().sum()
job_industries_missing_values


job_id         0
industry_id    0
dtype: int64

In [None]:
There are no missing values in the job_industries.csv dataset.

4. job_skills.csv

The job_skills.csv file contains the following columns:

job_id: An identifier for each job.
skill_abr: An abbreviation for the skills required for the job.
Here are some steps to consider for cleaning this dataset:

Checking for missing values.
Understanding and handling the skill_abr column to ensure consistency in the skill abbreviations.

In [12]:
# Load the fourth dataset
job_skills_df = pd.read_csv('job_skills.csv')

# Display the first few rows of the dataframe
job_skills_df.head()


Unnamed: 0,job_id,skill_abr
0,3690843087,ACCT
1,3690843087,FIN
2,3691763971,MGMT
3,3691763971,MNFC
4,3691775263,MGMT


In [13]:
# Check for missing values in the job_skills dataset
job_skills_missing_values = job_skills_df.isnull().sum()
job_skills_missing_values

job_id       0
skill_abr    0
dtype: int64

There are no missing values in the job_skills.csv dataset.

In [15]:
# Display unique values in the 'skill_abr' column
job_skills_unique_abr = job_skills_df['skill_abr'].unique()
job_skills_unique_abr[:20]  # Display the first 20 unique skill abbreviations to avoid a long output

array(['ACCT', 'FIN', 'MGMT', 'MNFC', 'HCPR', 'ENG', 'IT', 'ADM', 'SALE',
       'DSGN', 'ART', 'EDU', 'TRNG', 'BD', 'PRJM', 'CNSL', 'STRA', 'OTHR',
       'RSCH', 'GENB'], dtype=object)