<a href="https://colab.research.google.com/github/CapitalData/OpenSourceDataScienceAICore/blob/main/DataCleaning_With_SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Session 7: Data Cleaning With SQL**

### **Learning Objectives**


By the end of this module, you will be able to:

- Connect Python to a MySQL database

- Write efficient SQL queries for data extraction and aggregation

- Load and clean SQL data using Pandas


**Fetching Data From CSV to MySQL**

**1. Install Required Libraries**

In [1]:
#Installing mysql python connector package
!pip install mysql-connector-python

Collecting mysql-connector-python
  Downloading mysql_connector_python-9.3.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (7.2 kB)
Downloading mysql_connector_python-9.3.0-cp311-cp311-manylinux_2_28_x86_64.whl (33.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.9/33.9 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mysql-connector-python
Successfully installed mysql-connector-python-9.3.0


In [2]:
# Install MYSQL server
!sudo apt-get clean
!sudo apt-get purge mysql*
!sudo apt-get update
!sudo apt-get install -f
!sudo apt-get install mysql-server
!sudo apt-get dist-upgrade

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'mysql-testsuite' for glob 'mysql*'
Note, selecting 'mysql-server-5.5' for glob 'mysql*'
Note, selecting 'mysql-server-5.6' for glob 'mysql*'
Note, selecting 'mysql-server-5.7' for glob 'mysql*'
Note, selecting 'mysql-server-8.0' for glob 'mysql*'
Note, selecting 'mysql-client-5.5' for glob 'mysql*'
Note, selecting 'mysql-client-5.6' for glob 'mysql*'
Note, selecting 'mysql-client-5.7' for glob 'mysql*'
Note, selecting 'mysql-client-8.0' for glob 'mysql*'
Note, selecting 'mysql-common' for glob 'mysql*'
Note, selecting 'mysqltcl' for glob 'mysql*'
Note, selecting 'mysql-testsuite-5.5' for glob 'mysql*'
Note, selecting 'mysql-testsuite-5.6' for glob 'mysql*'
Note, selecting 'mysql-testsuite-5.7' for glob 'mysql*'
Note, selecting 'mysql-testsuite-8.0' for glob 'mysql*'
Note, selecting 'mysql-client' for glob 'mysql*'
Note, selecting 'mysql-router' for glob 'mysql*'
Note, selec

**2. Start MySQL Server**

In [3]:
#Start mysql server
!service mysql start

 * Starting MySQL database server mysqld
   ...done.


In [4]:
# Intended to change the MySQL root user's authentication method and set the # #password to 'root'
!mysql -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH 'mysql_native_password' BY 'root';FLUSH PRIVILEGES;"

**3. Import Required Modules**

In [1]:
import pathlib
import csv
import mysql.connector
import pandas as pd

**4.  Create Connection between python and MySQL server**

In [2]:
# Create a connection to the MySQL server
conn = mysql.connector.connect(user='root', password='root', host='localhost')

# Create a cursor to interact with the MySQL server
cursor = conn.cursor()

**5. Create New Database If it doesnot exist**

In [7]:
#Create New Database If it doesnot exist
cursor.execute("CREATE DATABASE IF NOT EXISTS imdb;")

In [8]:
#View databases in mysql server
cursor.execute("SHOW DATABASES;")
for x in cursor:
  print(x)

('imdb',)
('information_schema',)
('mysql',)
('performance_schema',)
('sys',)


- Database imdb has created.
- Now create table to store csv file .
- Table must have all column names as that of csv.

**6. View the csv before extracting to MYSQL using Pandas**

In [9]:
df=pd.read_csv("/content/drive/MyDrive/SQL_BAsic_to_Advanced/layoffs.csv")

In [10]:
df.head()

Unnamed: 0,company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions
0,Atlassian,Sydney,Other,500.0,0.05,3/6/2023,Post-IPO,Australia,210.0
1,SiriusXM,New York City,Media,475.0,0.08,3/6/2023,Post-IPO,United States,525.0
2,Alerzo,Ibadan,Retail,400.0,,3/6/2023,Series B,Nigeria,16.0
3,UpGrad,Mumbai,Education,120.0,,3/6/2023,Unknown,India,631.0
4,Loft,Sao Paulo,Real Estate,340.0,0.15,3/3/2023,Unknown,Brazil,788.0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2361 entries, 0 to 2360
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   company                2361 non-null   object 
 1   location               2361 non-null   object 
 2   industry               2357 non-null   object 
 3   total_laid_off         1621 non-null   float64
 4   percentage_laid_off    1576 non-null   float64
 5   date                   2360 non-null   object 
 6   stage                  2355 non-null   object 
 7   country                2361 non-null   object 
 8   funds_raised_millions  2152 non-null   float64
dtypes: float64(3), object(6)
memory usage: 166.1+ KB


**7. Create Table layoffs with Columns company,location,industry,totai_paid_off,percrntage_laid_off,date,stage,country,funds_raised_millions**

In [12]:
#Create Table layoffs with Columns company,location,industry,totai_paid_off,percrntage_laid_off,date,stage,country,funds_raised_millions
cursor.execute("DROP TABLE IF EXISTS imdb.layoffs;")
conn.commit()
cursor.execute("CREATE TABLE IF NOT EXISTS imdb.layoffs (company VARCHAR(30),location VARCHAR(30),industry VARCHAR(30),total_laid_off FLOAT,percentage_laid_off FLOAT,date DATE,stage VARCHAR(30),country VARCHAR(30),funds_raised_millions FLOAT);")
conn.commit()


In [13]:
#show column in table
cursor.execute("SHOW COLUMNS FROM imdb.layoffs;")
for x in cursor:
  print(x)


('company', 'varchar(30)', 'YES', '', None, '')
('location', 'varchar(30)', 'YES', '', None, '')
('industry', 'varchar(30)', 'YES', '', None, '')
('total_laid_off', 'float', 'YES', '', None, '')
('percentage_laid_off', 'float', 'YES', '', None, '')
('date', 'date', 'YES', '', None, '')
('stage', 'varchar(30)', 'YES', '', None, '')
('country', 'varchar(30)', 'YES', '', None, '')
('funds_raised_millions', 'float', 'YES', '', None, '')


In [14]:
#Viewing number of rows in table
cursor.execute("SELECT COUNT(*) FROM imdb.layoffs;")
for x in cursor:
  print(x)

(0,)


**8. Load data from csv file to table**
- Create  dictionary

In [15]:
#Defining csv file path
csv_path = pathlib.Path("/content/drive/MyDrive/SQL_BAsic_to_Advanced/layoffs.csv")
#An empty list called dict_list is created to store the dictionaries.
dict_list = list()
#Reading data from csv file and storing into dict_list
with csv_path.open(mode="r") as csv_reader:
   #A csv.reader object is created to read the CSV data row by row.
    csv_reader = csv.reader(csv_reader)
    for rows in csv_reader:
        dict_list.append({'company':rows[0], 'location':rows[1], 'industry':rows[2], 'total_laid_off':rows[3], 'percentage_laid_off':rows[4], 'date':rows[5], 'stage':rows[6], 'country':rows[7], 'funds_raised_millions':rows[8]})



In [None]:
dict_list[0]

{'company': 'company',
 'location': 'location',
 'industry': 'industry',
 'total_laid_off': 'total_laid_off',
 'percentage_laid_off': 'percentage_laid_off',
 'date': 'date',
 'stage': 'stage',
 'country': 'country',
 'funds_raised_millions': 'funds_raised_millions'}

In [None]:
dict_list[1]

{'company': 'Atlassian',
 'location': 'Sydney',
 'industry': 'Other',
 'total_laid_off': '500',
 'percentage_laid_off': '0.05',
 'date': '3/6/2023',
 'stage': 'Post-IPO',
 'country': 'Australia',
 'funds_raised_millions': '210'}

In [None]:
# #Replace Null values to 0
# for row in dict_list:
#     for key, value in row.items():
#         if value == 'NULL':
#             row[key] = 0

In [None]:
dict_list[0:10]

[{'company': 'company',
  'location': 'location',
  'industry': 'industry',
  'total_laid_off': 'total_laid_off',
  'percentage_laid_off': 'percentage_laid_off',
  'date': 'date',
  'stage': 'stage',
  'country': 'country',
  'funds_raised_millions': 'funds_raised_millions'},
 {'company': 'Atlassian',
  'location': 'Sydney',
  'industry': 'Other',
  'total_laid_off': '500',
  'percentage_laid_off': '0.05',
  'date': '3/6/2023',
  'stage': 'Post-IPO',
  'country': 'Australia',
  'funds_raised_millions': '210'},
 {'company': 'SiriusXM',
  'location': 'New York City',
  'industry': 'Media',
  'total_laid_off': '475',
  'percentage_laid_off': '0.08',
  'date': '3/6/2023',
  'stage': 'Post-IPO',
  'country': 'United States',
  'funds_raised_millions': '525'},
 {'company': 'Alerzo',
  'location': 'Ibadan',
  'industry': 'Retail',
  'total_laid_off': '400',
  'percentage_laid_off': 'NULL',
  'date': '3/6/2023',
  'stage': 'Series B',
  'country': 'Nigeria',
  'funds_raised_millions': '16'},
 

In [16]:
#drop first item from dict i.e.company:company
dict_list.pop(0)

{'company': 'company',
 'location': 'location',
 'industry': 'industry',
 'total_laid_off': 'total_laid_off',
 'percentage_laid_off': 'percentage_laid_off',
 'date': 'date',
 'stage': 'stage',
 'country': 'country',
 'funds_raised_millions': 'funds_raised_millions'}

In [17]:
#Now view the first item in dictonary
dict_list[0:5]

[{'company': 'Atlassian',
  'location': 'Sydney',
  'industry': 'Other',
  'total_laid_off': '500',
  'percentage_laid_off': '0.05',
  'date': '3/6/2023',
  'stage': 'Post-IPO',
  'country': 'Australia',
  'funds_raised_millions': '210'},
 {'company': 'SiriusXM',
  'location': 'New York City',
  'industry': 'Media',
  'total_laid_off': '475',
  'percentage_laid_off': '0.08',
  'date': '3/6/2023',
  'stage': 'Post-IPO',
  'country': 'United States',
  'funds_raised_millions': '525'},
 {'company': 'Alerzo',
  'location': 'Ibadan',
  'industry': 'Retail',
  'total_laid_off': '400',
  'percentage_laid_off': 'NULL',
  'date': '3/6/2023',
  'stage': 'Series B',
  'country': 'Nigeria',
  'funds_raised_millions': '16'},
 {'company': 'UpGrad',
  'location': 'Mumbai',
  'industry': 'Education',
  'total_laid_off': '120',
  'percentage_laid_off': 'NULL',
  'date': '3/6/2023',
  'stage': 'Unknown',
  'country': 'India',
  'funds_raised_millions': '631'},
 {'company': 'Loft',
  'location': 'Sao Pau

**9. Adding data to database**

In [18]:
#Alter date to string type
cursor.execute("ALTER TABLE imdb.layoffs MODIFY date VARCHAR(30)")
conn.commit()

In [None]:
# #Adding data to database
# for row in dict_list:
#   query = "INSERT INTO imdb.layoffs (company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s)"
#   value=(row['company'],row['location'],row['industry'],row['total_laid_off'],row['percentage_laid_off'],row['date'],row['stage'],row['country'],row['funds_raised_millions'])
#   cursor.execute(query,value)
#   conn.commit()


DatabaseError: 1265 (01000): Data truncated for column 'percentage_laid_off' at row 1

- Error rised due to datatype for total_paid_off which is defined as FLOAT.To reslive this error change to DECIMAL(10,2)

In [19]:
# Change the data type of total_paid_off,funds_raised_millions ,percentage_laid_off to DECIMAL(10,2)
cursor.execute("ALTER TABLE imdb.layoffs MODIFY funds_raised_millions DECIMAL(10,2)")
conn.commit()
cursor.execute("ALTER TABLE imdb.layoffs MODIFY total_laid_off DECIMAL(10,2)")
conn.commit()
cursor.execute("ALTER TABLE imdb.layoffs MODIFY percentage_laid_off DECIMAL(10,2)")
conn.commit()

In [20]:
#Alter company,industry,country varchar size to 255
cursor.execute("ALTER TABLE imdb.layoffs MODIFY company VARCHAR(255)")
conn.commit()
cursor.execute("ALTER TABLE imdb.layoffs MODIFY industry VARCHAR(255)")
conn.commit()
cursor.execute("ALTER TABLE imdb.layoffs MODIFY country VARCHAR(255)")
conn.commit()


In [21]:
#Adding data to database
for row in dict_list:
    # Convert 'total_paid_off' and 'funds_raised_millions' to numeric if possible
    try:
        row['total_laid_off'] = float(row['total_laid_off']) if row['total_laid_off'] else None  # Handle empty strings or non-numeric values
    except ValueError:
        row['total_laid_off'] = None  # If conversion fails, set to None
        print(f"Warning: Could not convert total_laid_off value to float for company: {row['company']}")

    try:
        row['funds_raised_millions'] = float(row['funds_raised_millions']) if row['funds_raised_millions'] else None
    except ValueError:
        row['funds_raised_millions'] = None
        print(f"Warning: Could not convert funds_raised_millions value to float for company: {row['company']}")
    # Convert 'percentage_laid_off' to float
    try:
        row['percentage_laid_off'] = float(row['percentage_laid_off']) if row['percentage_laid_off'] else None
    except ValueError:
        row['percentage_laid_off'] = None
        print(f"Warning: Could not convert percentage_laid_off value to float for company: {row['company']}")

    # # Handle potential date format issues
    # try:
    #     row['date'] = pd.to_datetime(row['date']).strftime('%Y-%m-%d')  # Adjust date format if needed
    # except ValueError:
    #     row['date'] = None
    #     print(f"Warning: Could not convert date value for company: {row['company']}")

    query = "INSERT INTO imdb.layoffs (company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s)"
    value=(row['company'],row['location'],row['industry'],row['total_laid_off'],row['percentage_laid_off'],row['date'],row['stage'],row['country'],row['funds_raised_millions'])
    cursor.execute(query,value)
    conn.commit()



In [None]:
# View data of table layoffs
cursor.execute("SELECT * FROM imdb.layoffs;")
print(f"{'Company':<20}{'Location':<20}{'Indusrty':<20}{'Total_Laid_off':<20}{'percentage_Laid_off':<20}{'Date':<20}{'Stage':<20}{'Country':<20}{'Funds_raised_millions':<20}")
print('-' * 200)
for x in cursor.fetchall():
    # Convert None values to empty strings for formatting
    formatted_values = [str(val) if val is not None else '' for val in x]
    print(f"{formatted_values[0]:<20}{formatted_values[1]:<20}{formatted_values[2]:<20}{formatted_values[3]:<20}{formatted_values[4]:<20}{formatted_values[5]:<20}{formatted_values[6]:<20}{formatted_values[7]:<20}{formatted_values[8]:<20}")

Company             Location            Indusrty            Total_Laid_off      percentage_Laid_off Date                Stage               Country             Funds_raised_millions
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Atlassian           Sydney              Other               500.00              0.05                3/6/2023            Post-IPO            Australia           210.00              
SiriusXM            New York City       Media               475.00              0.08                3/6/2023            Post-IPO            United States       525.00              
Alerzo              Ibadan              Retail              400.00                                  3/6/2023            Series B            Nigeria             16.00               
UpGrad              Mumbai              Education           120.00        

In [22]:
# View data of table layoffs
cursor.execute("SELECT * FROM imdb.layoffs LIMIT 5;")
print(f"{'Company':<20}{'Location':<20}{'Indusrty':<20}{'Total_Laid_off':<20}{'Percentage_Laid_off':<20}{'Date':<20}{'Stage':<20}{'Country':<20}{'Funds_raised_millions':<20}")
print('-' * 200)
for x in cursor.fetchall():
    # Convert None values to empty strings for formatting
    formatted_values = [str(val) if val is not None else '' for val in x]
    print(f"{formatted_values[0]:<20}{formatted_values[1]:<20}{formatted_values[2]:<20}{formatted_values[3]:<20}{formatted_values[4]:<20}{formatted_values[5]:<20}{formatted_values[6]:<20}{formatted_values[7]:<20}{formatted_values[8]:<20}")

Company             Location            Indusrty            Total_Laid_off      Percentage_Laid_off Date                Stage               Country             Funds_raised_millions
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Atlassian           Sydney              Other               500.00              0.05                3/6/2023            Post-IPO            Australia           210.00              
SiriusXM            New York City       Media               475.00              0.08                3/6/2023            Post-IPO            United States       525.00              
Alerzo              Ibadan              Retail              400.00                                  3/6/2023            Series B            Nigeria             16.00               
UpGrad              Mumbai              Education           120.00        

In [23]:
#Helpful to id to remove duplicates from dataset if primary key is not assigned.
query='''
 ALTER TABLE imdb.layoffs ADD COLUMN id INT AUTO_INCREMENT PRIMARY KEY;
 '''
cursor.execute(query)
conn.commit()


In [24]:
# View data of table layoffs
cursor.execute("SELECT * FROM imdb.layoffs LIMIT 5;")
print(f"{'Company':<20}{'Location':<20}{'Indusrty':<20}{'Total_Paid_off':<20}{'percentage_paid_off':<20}{'Date':<20}{'Stage':<20}{'Country':<20}{'Funds_raised_millions':<20}{'id':<20}")
print('-' * 200)
for x in cursor.fetchall():
    # Convert None values to empty strings for formatting
    formatted_values = [str(val) if val is not None else '' for val in x]
    print(f"{formatted_values[0]:<20}{formatted_values[1]:<20}{formatted_values[2]:<20}{formatted_values[3]:<20}{formatted_values[4]:<20}{formatted_values[5]:<20}{formatted_values[6]:<20}{formatted_values[7]:<20}{formatted_values[8]:<20}{formatted_values[9]:<20}")

Company             Location            Indusrty            Total_Paid_off      percentage_paid_off Date                Stage               Country             Funds_raised_millionsid                  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Atlassian           Sydney              Other               500.00              0.05                3/6/2023            Post-IPO            Australia           210.00              1                   
DocuSign            SF Bay Area         Sales               680.00              0.10                2/16/2023           Post-IPO            United States       536.00              2                   
Pico Interactive    SF Bay Area         Other               400.00              0.20                2/16/2023           Acquired            United States       62.00               3              

#### **10. Data Overview**

**1. Checking Table Columns and data type**

In [None]:
#Checking  columns and datatypes in table layoffs
cursor.execute("SHOW COLUMNS FROM imdb.layoffs;")
for x in cursor:
  print(x)

('company', 'varchar(255)', 'YES', '', None, '')
('location', 'varchar(30)', 'YES', '', None, '')
('industry', 'varchar(255)', 'YES', '', None, '')
('total_laid_off', 'decimal(10,2)', 'YES', '', None, '')
('percentage_laid_off', 'decimal(10,2)', 'YES', '', None, '')
('date', 'varchar(30)', 'YES', '', None, '')
('stage', 'varchar(30)', 'YES', '', None, '')
('country', 'varchar(255)', 'YES', '', None, '')
('funds_raised_millions', 'decimal(10,2)', 'YES', '', None, '')
('id', 'int', 'NO', 'PRI', None, 'auto_increment')


**2. Checking Number of Rows in Table.**

In [None]:
#Checking count of number of rows in data
cursor.execute("SELECT COUNT(*) FROM imdb.layoffs;")
for x in cursor:
  print(x)

print("Total  Number of Rows in table layoffs are ",x[0])

(2361,)
Total  Number of Rows in table layoffs are  2361


**3. Checking Number of columns in Table**

In [None]:
# Number of columns in tables
cursor.execute("SELECT COUNT(*) FROM information_schema.columns WHERE table_name = 'layoffs';")
for x in cursor:
  print(x)

print("Total  Number of Columns in table layoffs are ",x[0])

(10,)
Total  Number of Columns in table layoffs are  10


### **Removing Duplicates**

**4. Checking Duplicates in Table**

 1. Find Duplicates Based on One Column
 2. Find Duplicates Based on Multiple Columns

In [None]:
# Checking duplicates in table
cursor.execute("SELECT company, COUNT(*) FROM imdb.layoffs GROUP BY company HAVING COUNT(*) > 1;")
for x in cursor:
  print(x)

('Alerzo', 2)
('Loft', 6)
('Airbnb', 2)
('Indigo', 2)
('MasterClass', 2)
('Sonder', 3)
('Eventbrite', 2)
('Cerebral', 3)
('Amount', 2)
('Outreach', 2)
('Twitter', 4)
('Velodyne Lidar', 2)
('OneFootball', 2)
('Dapper Labs', 2)
('TaskUs', 2)
('Arch Oncology', 2)
('Immutable', 2)
('Ethos Life', 3)
('Bolt', 3)
('PeerStreet', 3)
('Tencent', 2)
('Chipper Cash', 2)
('DocuSign', 2)
('The RealReal', 2)
('Convoy', 4)
('Wix', 2)
('Neon', 2)
('Jellysmack', 2)
('Sprinklr', 3)
('Divvy Homes', 3)
('Momentive', 2)
('Twilio', 2)
('Electric', 2)
('iRobot', 2)
('Foodpanda', 3)
('Getir', 3)
('LinkedIn', 2)
('TripleLift', 2)
('Yahoo', 2)
('Deliveroo', 2)
('Veriff', 2)
('GoDaddy', 2)
('Affirm', 2)
('Equitybee', 2)
('Koho', 2)
('Medly', 3)
('Nearmap', 2)
('Salesloft', 2)
('Loggi', 2)
('Clari', 2)
('C6 Bank', 2)
('Lightico', 2)
('FarEye', 2)
("Byju's", 2)
('Okta', 2)
('Desktop Metal', 3)
('Getaround', 2)
('Talkdesk', 2)
('Splunk', 2)
('Pinterest', 2)
('Wheel', 2)
('Appgate', 2)
('TheSkimm', 2)
('Ada', 2)
('Bu

- Need to check each rows with query before delete the duplicate.

In [None]:
query="""
 SELECT *
 FROM imdb.layoffs
 WHERE company IN ('Alerzo','Loft');
 """
cursor.execute(query)
for x in cursor:
  print(x)

('Alerzo', 'Ibadan', 'Retail', Decimal('400.00'), None, '3/6/2023', 'Series B', 'Nigeria', Decimal('16.00'), 3)
('Loft', 'Sao Paulo', 'Real Estate', Decimal('340.00'), Decimal('0.15'), '3/3/2023', 'Unknown', 'Brazil', Decimal('788.00'), 5)
('Loft', 'Sao Paulo', 'Real Estate', Decimal('312.00'), Decimal('0.12'), '12/7/2022', 'Unknown', 'Brazil', Decimal('788.00'), 546)
('Alerzo', 'Ibadan', 'Retail', None, None, '9/2/2022', 'Series B', 'Nigeria', Decimal('16.00'), 1017)
('Loft', 'Sao Paulo', 'Real Estate', Decimal('384.00'), Decimal('0.12'), '7/5/2022', 'Unknown', 'Brazil', Decimal('788.00'), 1318)
('Loft', 'Sao Paulo', 'Real Estate', Decimal('159.00'), None, '4/19/2022', 'Unknown', 'Brazil', Decimal('788.00'), 1621)
('Loft', 'Sao Paulo', 'Real Estate', Decimal('47.00'), Decimal('0.10'), '4/17/2020', 'Series C', 'Brazil', Decimal('263.00'), 2036)
('Loft', 'Sao Paulo', 'Real Estate', Decimal('47.00'), None, '3/27/2020', 'Series C', 'Brazil', Decimal('263.00'), 2277)


In [None]:
query="""
 SELECT *
 FROM imdb.layoffs
 WHERE company = 'The Predictive Index';
 """
cursor.execute(query)
for x in cursor:
  print(x)

('The Predictive Index', 'Boston', 'HR', Decimal('40.00'), None, '8/2/2022', 'Acquired', 'United States', Decimal('71.00'), 1169)
('The Predictive Index', 'Boston', 'HR', Decimal('59.00'), Decimal('0.25'), '4/2/2020', 'Acquired', 'United States', Decimal('65.00'), 2190)


In [None]:
# Check duplicates by company,industry,total_laid_off and date
query="""
SELECT id, company, industry, total_laid_off,date,
		ROW_NUMBER() OVER (
			PARTITION BY company, industry, total_laid_off,date ORDER BY id) AS row_num
	FROM
	imdb.layoffs;
    """
cursor.execute(query)
for x in cursor:
  print(x)


(487, ' E Inc.', 'Transportation', None, '12/16/2022', 1)
(1219, ' Included Health', 'Healthcare', None, '7/25/2022', 1)
(678, '&Open', 'Marketing', Decimal('9.00'), '11/17/2022', 1)
(236, '#Paid', 'Marketing', Decimal('19.00'), '1/27/2023', 1)
(1284, '100 Thieves', 'Consumer', Decimal('12.00'), '7/13/2022', 1)
(417, '100 Thieves', 'Retail', None, '1/10/2023', 1)
(1143, '10X Genomics', 'Healthcare', Decimal('100.00'), '8/4/2022', 1)
(2187, '1stdibs', 'Retail', Decimal('70.00'), '4/2/2020', 1)
(1522, '2TM', 'Crypto', Decimal('90.00'), '6/1/2022', 1)
(1019, '2TM', 'Crypto', Decimal('100.00'), '9/1/2022', 1)
(1200, '2U', 'Education', None, '7/28/2022', 1)
(1040, '54gene', 'Healthcare', Decimal('95.00'), '8/29/2022', 1)
(1503, '5B Solar', 'Energy', None, '6/3/2022', 1)
(891, '6sense', 'Sales', Decimal('150.00'), '10/12/2022', 1)
(329, '80 Acres Farms', 'Food', None, '1/18/2023', 1)
(319, '8x8', 'Support', Decimal('155.00'), '1/18/2023', 1)
(922, '8x8', 'Support', Decimal('200.00'), '10/4/2

In [None]:
#Check all duplicates with row_number>1
query="""
SELECT *
FROM (
	SELECT company, industry, total_laid_off,date,
		ROW_NUMBER() OVER (
			PARTITION BY company, industry, total_laid_off,date order by id
			) AS row_num
	FROM
		imdb.layoffs
) duplicates
WHERE
	row_num > 1;
  """

cursor.execute(query)
for x in cursor:
  print(x)

('Casper', 'Retail', None, '9/14/2021', 2)
('Cazoo', 'Transportation', Decimal('750.00'), '6/7/2022', 2)
('Hibob', 'HR', Decimal('70.00'), '3/30/2020', 2)
('Oda', 'Food', Decimal('70.00'), '11/1/2022', 2)
('Oda', 'Food', Decimal('70.00'), '11/1/2022', 3)
('Terminus', 'Marketing', None, '5/27/2022', 2)
('Wildlife Studios', 'Consumer', Decimal('300.00'), '11/28/2022', 2)
('Yahoo', 'Consumer', Decimal('1600.00'), '2/9/2023', 2)


In [None]:
# let's just look at oda to confirm
query="""
 SELECT *
 FROM imdb.layoffs
 WHERE company = 'Oda';
 """
cursor.execute(query)
for x in cursor:
  print(x)

('Oda', 'Oslo', 'Food', Decimal('70.00'), Decimal('0.18'), '11/1/2022', 'Unknown', 'Sweden', Decimal('377.00'))
('Oda', 'Oslo', 'Food', Decimal('70.00'), Decimal('0.18'), '11/1/2022', 'Unknown', 'Norway', Decimal('477.00'))
('Oda', 'Oslo', 'Food', Decimal('70.00'), Decimal('0.06'), '11/1/2022', 'Unknown', 'Norway', Decimal('479.00'))


- It looks like these are all legitimate entries and shouldn't be deleted. We need to really look at every single row to be accurate.

- To check the duplicate values ,we need to scan for all values of queries.
- Crosscheck manually.

In [None]:
# Get duplicates considering all columns values
query="""
SELECT *
FROM (
	SELECT company, location, industry, total_laid_off,percentage_laid_off,date, stage, country, funds_raised_millions,
		ROW_NUMBER() OVER (
			PARTITION BY company, location, industry, total_laid_off,percentage_laid_off,`date`, stage, country, funds_raised_millions
			ORDER BY id) AS row_num
	FROM
		imdb.layoffs
) duplicates
WHERE
	row_num > 1;
  """

cursor.execute(query)
for x in cursor:
  print(x)


('Casper', 'New York City', 'Retail', None, None, '9/14/2021', 'Post-IPO', 'United States', Decimal('339.00'), 2)
('Cazoo', 'London', 'Transportation', Decimal('750.00'), Decimal('0.15'), '6/7/2022', 'Post-IPO', 'United Kingdom', Decimal('2000.00'), 2)
('Hibob', 'Tel Aviv', 'HR', Decimal('70.00'), Decimal('0.30'), '3/30/2020', 'Series A', 'Israel', Decimal('45.00'), 2)
('Wildlife Studios', 'Sao Paulo', 'Consumer', Decimal('300.00'), Decimal('0.20'), '11/28/2022', 'Unknown', 'Brazil', Decimal('260.00'), 2)
('Yahoo', 'SF Bay Area', 'Consumer', Decimal('1600.00'), Decimal('0.20'), '2/9/2023', 'Acquired', 'United States', Decimal('6.00'), 2)


**Remove Duplicates**

Now, write CTE to remove the Duplicate values.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2361 entries, 0 to 2360
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   company                2361 non-null   object 
 1   location               2361 non-null   object 
 2   industry               2357 non-null   object 
 3   total_laid_off         1621 non-null   float64
 4   percentage_laid_off    1576 non-null   float64
 5   date                   2360 non-null   object 
 6   stage                  2355 non-null   object 
 7   country                2361 non-null   object 
 8   funds_raised_millions  2152 non-null   float64
dtypes: float64(3), object(6)
memory usage: 166.1+ KB


In [None]:
#find duplicates in df
df.duplicated().sum()

np.int64(5)

In [None]:
#show duplicate values
df[df.duplicated()]

Unnamed: 0,company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions
1492,Cazoo,London,Transportation,750.0,0.15,6/7/2022,Post-IPO,United Kingdom,2000.0
2357,Yahoo,SF Bay Area,Consumer,1600.0,0.2,2/9/2023,Acquired,United States,6.0
2358,Hibob,Tel Aviv,HR,70.0,0.3,3/30/2020,Series A,Israel,45.0
2359,Casper,New York City,Retail,,,9/14/2021,Post-IPO,United States,339.0
2360,Wildlife Studios,Sao Paulo,Consumer,300.0,0.2,11/28/2022,Unknown,Brazil,260.0


In [None]:
df['company'].value_counts()

Unnamed: 0_level_0,count
company,Unnamed: 1_level_1
Loft,6
OYO,5
Uber,5
Swiggy,5
Truepill,4
...,...
ZenBusiness,1
Wavely,1
SendCloud,1
Reforge,1


In [None]:
#show the values where company= loft
df[df['company']=='Oda']

Unnamed: 0,company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions
816,Oda,Oslo,Food,70.0,0.18,11/1/2022,Unknown,Sweden,377.0
817,Oda,Oslo,Food,70.0,0.18,11/1/2022,Unknown,Norway,477.0
818,Oda,Oslo,Food,70.0,0.06,11/1/2022,Unknown,Norway,479.0


In [None]:
#unique values in df
df.nunique()

Unnamed: 0,0
company,1893
location,191
industry,32
total_laid_off,285
percentage_laid_off,73
date,483
stage,16
country,60
funds_raised_millions,638


In [None]:
query='''
 SELECT COUNT(*) ,industry from imdb.layoffs group by industry;
'''
cursor.execute(query)
for x in cursor:
  print(x)

(129, 'Other')
(95, 'Media')
(192, 'Retail')
(93, 'Education')
(117, 'Real Estate')
(145, 'Transportation')
(139, 'Marketing')
(3, '')
(183, 'Healthcare')
(74, 'Security')
(141, 'Food')
(31, 'Fitness')
(114, 'Consumer')
(42, 'Logistics')
(64, 'HR')
(43, 'Support')
(66, 'Travel')
(99, 'Crypto')
(284, 'Finance')
(79, 'Data')
(37, 'Sales')
(43, 'Infrastructure')
(17, 'Hardware')
(35, 'Product')
(16, 'Construction')
(13, 'Legal')
(12, 'Energy')
(1, 'NULL')
(2, 'Manufacturing')
(28, 'Recruiting')
(6, 'Aerospace')
(2, 'Crypto Currency')
(3, 'Fin-Tech')
(1, 'CryptoCurrency')


In [None]:
# View the duplicates with CTE
cursor.fetchall()
query='''
WITH DELETE_CTE AS
(
SELECT *
FROM (
	SELECT company, location, industry, total_laid_off,percentage_laid_off,date, stage, country, funds_raised_millions,
		ROW_NUMBER() OVER (
			PARTITION BY company, location, industry, total_laid_off,percentage_laid_off,date, stage, country, funds_raised_millions
			ORDER BY id) AS row_num
	FROM
		imdb.layoffs
) duplicates
WHERE
	row_num > 1
)
SELECT * FROM DELETE_CTE;
'''
cursor.execute(query)
for x in cursor:
  print(x)



('Casper', 'New York City', 'Retail', None, None, '9/14/2021', 'Post-IPO', 'United States', Decimal('339.00'), 2)
('Cazoo', 'London', 'Transportation', Decimal('750.00'), Decimal('0.15'), '6/7/2022', 'Post-IPO', 'United Kingdom', Decimal('2000.00'), 2)
('Hibob', 'Tel Aviv', 'HR', Decimal('70.00'), Decimal('0.30'), '3/30/2020', 'Series A', 'Israel', Decimal('45.00'), 2)
('Wildlife Studios', 'Sao Paulo', 'Consumer', Decimal('300.00'), Decimal('0.20'), '11/28/2022', 'Unknown', 'Brazil', Decimal('260.00'), 2)
('Yahoo', 'SF Bay Area', 'Consumer', Decimal('1600.00'), Decimal('0.20'), '2/9/2023', 'Acquired', 'United States', Decimal('6.00'), 2)


In [25]:
cursor.fetchall()
#Remove the duplicates from table
query='''
WITH DELETE_CTE AS (
	SELECT id, company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions,
    ROW_NUMBER() OVER (PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions ORDER BY id) AS row_num
	FROM imdb.layoffs
)
DELETE FROM imdb.layoffs
WHERE id IN (SELECT id FROM DELETE_CTE WHERE row_num > 1);
'''
cursor.execute(query)
conn.commit()


In [26]:
# Check the table for duplicates after removing duplicates
query="""
SELECT *
FROM (
	SELECT company, location, industry, total_laid_off,percentage_laid_off,date, stage, country, funds_raised_millions,
		ROW_NUMBER() OVER (
			PARTITION BY company, location, industry, total_laid_off,percentage_laid_off,date, stage, country, funds_raised_millions
			ORDER BY id) AS row_num
	FROM
		imdb.layoffs
) duplicates
WHERE
	row_num > 1;
  """

cursor.execute(query)
for x in cursor:
  print(x)


In [27]:

#Checking row number for duplicated rows
query="""
SELECT * FROM imdb.layoffs
WHERE company in ('Casper','Cazoo','Hibob','Wildlife Studios','Yahoo');
"""
cursor.execute(query)
for x in cursor:
  print(x)

('Yahoo', 'SF Bay Area', 'Consumer', Decimal('1600.00'), Decimal('0.20'), '2/9/2023', 'Acquired', 'United States', Decimal('6.00'), 87)
('Wildlife Studios', 'Sao Paulo', 'Consumer', Decimal('300.00'), Decimal('0.20'), '11/28/2022', 'Unknown', 'Brazil', Decimal('260.00'), 307)
('Cazoo', 'London', 'Transportation', None, None, '1/18/2023', 'Post-IPO', 'United Kingdom', Decimal('2000.00'), 1181)
('Casper', 'New York City', 'Retail', None, None, '9/14/2021', 'Post-IPO', 'United States', Decimal('339.00'), 1479)
('Cazoo', 'London', 'Transportation', Decimal('750.00'), Decimal('0.15'), '6/7/2022', 'Post-IPO', 'United Kingdom', Decimal('2000.00'), 1606)
('Casper', 'New York City', 'Retail', Decimal('78.00'), Decimal('0.21'), '4/21/2020', 'Post-IPO', 'United States', Decimal('339.00'), 1762)
('Hibob', 'Tel Aviv', 'HR', Decimal('70.00'), Decimal('0.30'), '3/30/2020', 'Series A', 'Israel', Decimal('45.00'), 2133)


In [28]:
#Checking Number of rows after removing duplicates
cursor.execute("SELECT COUNT(*) FROM imdb.layoffs;")
for x in cursor:
  print(x)

print("Total  Number of Rows in table layoffs after removing duplicates are ",x[0])

(2356,)
Total  Number of Rows in table layoffs after removing duplicates are  2356


- Succesfully removed duplicates.

Cross checking using pandas

In [29]:
#remove the duplicateds with pandas drop_duplicates()
df.drop_duplicates(inplace=True)

In [30]:
#Df with comapany name Casper','Cazoo','Hibob','Wildlife Studios','Yahoo
df[df['company'].isin(['Casper','Cazoo','Hibob','Wildlife Studios','Yahoo'])].value_counts()


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,count
company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions,Unnamed: 9_level_1
Casper,New York City,Retail,78.0,0.21,4/21/2020,Post-IPO,United States,339.0,1
Cazoo,London,Transportation,750.0,0.15,6/7/2022,Post-IPO,United Kingdom,2000.0,1
Hibob,Tel Aviv,HR,70.0,0.3,3/30/2020,Series A,Israel,45.0,1
Wildlife Studios,Sao Paulo,Consumer,300.0,0.2,11/28/2022,Unknown,Brazil,260.0,1
Yahoo,SF Bay Area,Consumer,1600.0,0.2,2/9/2023,Acquired,United States,6.0,1


In [None]:
df[df['company'].isin(['Casper','Cazoo','Hibob','Wildlife Studios','Yahoo'])]

Unnamed: 0,company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions
118,Yahoo,SF Bay Area,Consumer,1600.0,0.2,2/9/2023,Acquired,United States,6.0
332,Cazoo,London,Transportation,,,1/18/2023,Post-IPO,United Kingdom,2000.0
625,Wildlife Studios,Sao Paulo,Consumer,300.0,0.2,11/28/2022,Unknown,Brazil,260.0
1491,Cazoo,London,Transportation,750.0,0.15,6/7/2022,Post-IPO,United Kingdom,2000.0
1689,Casper,New York City,Retail,,,9/14/2021,Post-IPO,United States,339.0
2022,Casper,New York City,Retail,78.0,0.21,4/21/2020,Post-IPO,United States,339.0
2254,Hibob,Tel Aviv,HR,70.0,0.3,3/30/2020,Series A,Israel,45.0


In [31]:
#Checking for duplicates value using duplicated().sum()
df.duplicated().sum()

np.int64(0)

In [32]:
#Checking number rows
df.shape

(2356, 9)

#### **Detecting Missing Values**

#### **What Are Missing Values?**
- Missing values are placeholders for data that is:
  - Not collected
  - Not applicable
  - Invalid or corrupted
  - Deliberately left blank

- They are usually represented as:
  - NULL (SQL)
  - NaN (Pandas, NumPy)
  - NA / empty strings (' ')
  - Placeholder values like -9999, 999, unknown, None,etc.



#### **Causes of Missing Values**
- Incomplete user input
  - e.g.	Form skipped optional fields
- Data corruption
  - e.g. Logging failure or packet loss
- Schema mismatch during merge
  - e.g. Field exists in one source but not another
- Bad joins in ETL
  - e.g. Foreign key not found in lookup table
- Time-series dropout
  - e.g. IoT sensor goes offline intermittently

#### **How to handle missing values?**
- **Detection**
```
SELECT * FROM table WHERE column IS NULL;
```
```
 With pandas
 df.isnull().sum()
 df[df.isnull().sum()] -- to get rows with missing values
```
- **Handling Strategies**
  - Replace (Impute) NULLs
    1.  Replace Missing Numeric Values with 0
     ```
      UPDATE sales
     SET amount = 0
     WHERE amount IS NULL;
     ```
    2. Replace with mean,median.
    ```
     UPDATE sales
     SET amount = (
     SELECT AVG(amount) FROM (SELECT amount FROM sales WHERE amount IS NOT NULL) AS t
     )
    WHERE amount IS NULL;
  ```
    3. Replace Categorical Values with 'Unknown',or mode.
     ```
      UPDATE users
      SET city = 'Unknown'
      WHERE city IS NULL;
    ```
     - Find mode
     ```
     SELECT city
     FROM users
     WHERE city IS NOT NULL
     GROUP BY city
     ORDER BY COUNT(*) DESC
     LIMIT 1;
     ```
     ```
      UPDATE users
      SET city = 'mode_value'
      WHERE city IS NULL;
    ```
    4. Use COALESCE() in SELECT Queries
     ```
      SELECT id,
       COALESCE(city, 'Unknown') AS city_cleaned,
       COALESCE(amount, 0) AS amount_cleaned
      FROM users;
     ```
    5. Replace Missing Dates with a Default
     ```
      UPDATE events
      SET event_date = '2024-01-01'
      WHERE event_date IS NULL;
    ```
    ```
     UPDATE events
     SET event_date = CURRENT_DATE
     WHERE event_date IS NULL;
     ```


  - Drop Rows with NULL values
  - Aggregate with NULL Awareness
   - Method to find median
    ```
    SELECT amount
    FROM sales
    WHERE amount IS NOT NULL
    ORDER BY amount
    LIMIT 1
    OFFSET (SELECT FLOOR(COUNT(*)/2) FROM sales WHERE amount IS NOT NULL);
   ```


**1. Checking Missing Values**

In [None]:
#checking missing values
query="""
SELECT *
FROM imdb.layoffs
WHERE company IS NULL OR location IS NULL OR industry IS NULL OR total_laid_off IS NULL OR percentage_laid_off IS NULL OR date IS NULL OR stage IS NULL OR country IS NULL OR funds_raised_millions IS NULL;
"""
cursor.execute(query)
for x in cursor:
  print(x)

('Alerzo', 'Ibadan', 'Retail', Decimal('400.00'), None, '3/6/2023', 'Series B', 'Nigeria', Decimal('16.00'), 3)
('UpGrad', 'Mumbai', 'Education', Decimal('120.00'), None, '3/6/2023', 'Unknown', 'India', Decimal('631.00'), 4)
('Lendi', 'Sydney', 'Real Estate', Decimal('100.00'), None, '3/3/2023', 'Unknown', 'Australia', Decimal('59.00'), 7)
('UserTesting', 'SF Bay Area', 'Marketing', Decimal('63.00'), None, '3/3/2023', 'Acquired', 'United States', Decimal('152.00'), 8)
('Airbnb', 'SF Bay Area', '', Decimal('30.00'), None, '3/3/2023', 'Post-IPO', 'United States', Decimal('6400.00'), 9)
('Accolade', 'Seattle', 'Healthcare', None, None, '3/3/2023', 'Post-IPO', 'United States', Decimal('458.00'), 10)
('Indigo', 'Boston', 'Other', None, None, '3/3/2023', 'Series F', 'United States.', Decimal('1200.00'), 11)
('MasterClass', 'SF Bay Area', 'Education', Decimal('79.00'), None, '3/2/2023', 'Series E', 'United States', Decimal('461.00'), 13)
('Ambev Tech', 'Blumenau', 'Food', Decimal('50.00'), No

In [None]:
#checking missing values
query="""
SELECT COUNT(*)
FROM imdb.layoffs
WHERE company IS NULL OR location IS NULL OR industry IS NULL OR total_laid_off IS NULL OR percentage_laid_off IS NULL OR date IS NULL OR stage IS NULL OR country IS NULL OR funds_raised_millions IS NULL;
"""
cursor.execute(query)
for x in cursor:
  print(x)

(1267,)


**Checking Null values with Pandas**

In [None]:
# checking missing values
df.isnull().sum()

Unnamed: 0,0
company,0
location,0
industry,4
total_laid_off,739
percentage_laid_off,784
date,1
stage,6
country,0
funds_raised_millions,209


In [None]:
#Count of null values
df.isnull().sum().sum()

np.int64(1743)

In [None]:
#checking missing values from industry
query="""
SELECT Count(*)
FROM imdb.layoffs
WHERE industry IS NULL OR industry='NULL' OR industry='';
"""
cursor.execute(query)
for x in cursor:
  print(x)


(4,)


In [None]:
#Checking missing value for total_laid_off
query="""
SELECT Count(*)
FROM imdb.layoffs
WHERE total_laid_off IS NULL OR total_laid_off='NULL' OR total_laid_off='';
"""
cursor.execute(query)
for x in cursor:
  print(x)

(739,)


In [None]:
#Checking missing value for total_laid_off
query="""
SELECT *
FROM imdb.layoffs
WHERE total_laid_off IS NULL limit 5;
"""
cursor.execute(query)
for x in cursor:
  print(x)

('Accolade', 'Seattle', 'Healthcare', None, None, '3/3/2023', 'Post-IPO', 'United States', Decimal('458.00'), 10)
('Indigo', 'Boston', 'Other', None, None, '3/3/2023', 'Series F', 'United States.', Decimal('1200.00'), 11)
('Flipkart', 'Bengaluru', 'Retail', None, None, '3/2/2023', 'Acquired', 'India', Decimal('12900.00'), 17)
('Kandela', 'Los Angeles', 'Consumer', None, Decimal('1.00'), '3/2/2023', 'Acquired', 'United States', None, 18)
('Truckstop.com', 'Boise', 'Logistics', None, None, '3/2/2023', 'Acquired', 'United States', None, 19)


In [None]:
#Checking Missing values for percentage_laid_off
query="""
SELECT Count(*)
FROM imdb.layoffs
WHERE percentage_laid_off IS NULL OR percentage_laid_off='NULL' OR percentage_laid_off='';
"""
cursor.execute(query)
for x in cursor:
  print(x)

(785,)


In [None]:
#View percentage_laid_off with pandas
df[df['percentage_laid_off'].isnull()]



Unnamed: 0,company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions
2,Alerzo,Ibadan,Retail,400.0,,3/6/2023,Series B,Nigeria,16.0
3,UpGrad,Mumbai,Education,120.0,,3/6/2023,Unknown,India,631.0
6,Lendi,Sydney,Real Estate,100.0,,3/3/2023,Unknown,Australia,59.0
7,UserTesting,SF Bay Area,Marketing,63.0,,3/3/2023,Acquired,United States,152.0
8,Airbnb,SF Bay Area,,30.0,,3/3/2023,Post-IPO,United States,6400.0
...,...,...,...,...,...,...,...,...,...
2341,Bounce,Bengaluru,Transportation,120.0,,3/19/2020,Series D,India,214.0
2344,Lola,Boston,Travel,34.0,,3/19/2020,Series C,United States,81.0
2345,Anyvision,Tel Aviv,Security,,,3/19/2020,Series A,Israel,74.0
2347,Tuft & Needle,Phoenix,Retail,,,3/19/2020,Acquired,United States,0.0


In [None]:
# Missing value for date
query="""
SELECT Count(*)
FROM imdb.layoffs
WHERE date IS NULL OR date='NULL' OR date='';
"""
cursor.execute(query)
for x in cursor:
  print(x)

(1,)


In [None]:
# Missing value for date
query="""
SELECT *
FROM imdb.layoffs
WHERE date IS NULL OR date='NULL' OR date='';
"""
cursor.execute(query)
for x in cursor:
  print(x)

('Blackbaud', 'Charleston', 'Other', Decimal('500.00'), Decimal('0.14'), 'NULL', 'Post-IPO', 'United States', None, 2357)


In [None]:
# Crosscheck with pandas
df[df['date'].isnull()]

Unnamed: 0,company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions
2356,Blackbaud,Charleston,Other,500.0,0.14,,Post-IPO,United States,


In [None]:
# Missing value for date
query="""
SELECT *
FROM imdb.layoffs
WHERE company='Blackbaud';
"""
cursor.execute(query)
for x in cursor:
  print(x)

('Blackbaud', 'Charleston', 'Other', Decimal('500.00'), Decimal('0.14'), 'NULL', 'Post-IPO', 'United States', None, 2357)


In [None]:
#Missing value for stage
query="""
SELECT Count(*)
FROM imdb.layoffs
WHERE stage IS NULL OR stage='NULL' OR stage='';
"""
cursor.execute(query)
for x in cursor:
  print(x)

(6,)


In [None]:
#Missing value for funds_raised_millions
query="""
SELECT Count(*)
FROM imdb.layoffs
WHERE funds_raised_millions IS NULL OR funds_raised_millions='NULL' OR funds_raised_millions='';
"""
cursor.execute(query)
for x in cursor:
  print(x)

(216,)


In [None]:
#Missing value for funds_raised_millions
query="""
SELECT count(*)
FROM imdb.layoffs
WHERE funds_raised_millions IS NULL;
"""
cursor.execute(query)
for x in cursor:
  print(x)

(216,)


In [None]:
#Missing value for funds_raised_millions
query="""
SELECT *
FROM imdb.layoffs
WHERE funds_raised_millions IS NULL OR funds_raised_millions='NULL' OR funds_raised_millions='';
"""
cursor.execute(query)
for x in cursor:
  print(x)

('Ambev Tech', 'Blumenau', 'Food', Decimal('50.00'), None, '3/2/2023', 'Acquired', 'Brazil', None, 14)
('Kandela', 'Los Angeles', 'Consumer', None, Decimal('1.00'), '3/2/2023', 'Acquired', 'United States', None, 18)
('Truckstop.com', 'Boise', 'Logistics', None, None, '3/2/2023', 'Acquired', 'United States', None, 19)
('DUX Education', 'Bengaluru', 'Education', None, Decimal('1.00'), '2/28/2023', 'Unknown', 'India', None, 30)
('Merative', 'Ann Arbor', 'Healthcare', Decimal('200.00'), Decimal('0.10'), '2/23/2023', 'Acquired', 'United States', None, 56)
('The Iconic', 'Sydney', 'Retail', Decimal('69.00'), Decimal('0.06'), '2/23/2023', 'Unknown', 'Australia', None, 60)
('Vibrent Health', 'Washington D.C.', 'Healthcare', None, Decimal('0.13'), '2/23/2023', 'Unknown', 'United States', None, 70)
('Synamedia', 'London', 'Media', Decimal('200.00'), Decimal('0.12'), '2/22/2023', 'Unknown', 'United Kingdom', None, 71)
('CommerceHub', 'Albany', 'Retail', Decimal('371.00'), Decimal('0.31'), '2/14/2

In [None]:
df['funds_raised_millions'].value_counts(dropna=False)

Unnamed: 0_level_0,count
funds_raised_millions,Unnamed: 1_level_1
,209
2.0,22
1.0,21
20.0,19
11.0,18
...,...
574.0,1
179.1,1
296.0,1
359.0,1


In [None]:
#Count total null values with mysql
query="""
SELECT COUNT(*)
FROM imdb.layoffs
WHERE company IS NULL OR location IS NULL OR industry IS NULL OR total_laid_off IS NULL OR percentage_laid_off IS NULL OR date IS NULL OR stage IS NULL OR country IS NULL OR funds_raised_millions IS NULL;
"""
cursor.execute(query)
for x in cursor:
  print(x)

(1267,)


In [None]:
df[df['funds_raised_millions'].notnull()].tail(20)

Unnamed: 0,company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions
2335,Flywheel Sports,New York City,Fitness,784.0,0.98,3/20/2020,Acquired,United States,120.0
2336,Peek,Salt Lake City,Travel,45.0,,3/20/2020,Series B,United States,39.0
2337,CTO.ai,Vancouver,Infrastructure,30.0,0.5,3/20/2020,Seed,Canada,7.0
2338,Yonder,Austin,Media,18.0,,3/20/2020,Series A,United States,16.0
2339,Service,Los Angeles,Travel,,1.0,3/20/2020,Seed,United States,5.0
2340,Vacasa,Portland,Travel,,,3/20/2020,Series C,United States,526.0
2341,Bounce,Bengaluru,Transportation,120.0,,3/19/2020,Series D,India,214.0
2343,Remote Year,Chicago,Travel,50.0,0.5,3/19/2020,Series B,United States,17.0
2344,Lola,Boston,Travel,34.0,,3/19/2020,Series C,United States,81.0
2345,Anyvision,Tel Aviv,Security,,,3/19/2020,Series A,Israel,74.0


In [None]:
len(df[df['funds_raised_millions']==0])

7

In [None]:
 #Count total null values with mysql
query="""
SELECT *
FROM imdb.layoffs
WHERE company IS NULL OR location IS NULL OR industry IS NULL OR total_laid_off IS NULL OR percentage_laid_off IS NULL OR date IS NULL OR stage IS NULL OR country IS NULL OR funds_raised_millions IS NULL;
"""
cursor.execute(query)
for x in cursor:
  print(x)

('Alerzo', 'Ibadan', 'Retail', Decimal('400.00'), None, '3/6/2023', 'Series B', 'Nigeria', Decimal('16.00'), 3)
('UpGrad', 'Mumbai', 'Education', Decimal('120.00'), None, '3/6/2023', 'Unknown', 'India', Decimal('631.00'), 4)
('Lendi', 'Sydney', 'Real Estate', Decimal('100.00'), None, '3/3/2023', 'Unknown', 'Australia', Decimal('59.00'), 7)
('UserTesting', 'SF Bay Area', 'Marketing', Decimal('63.00'), None, '3/3/2023', 'Acquired', 'United States', Decimal('152.00'), 8)
('Airbnb', 'SF Bay Area', '', Decimal('30.00'), None, '3/3/2023', 'Post-IPO', 'United States', Decimal('6400.00'), 9)
('Accolade', 'Seattle', 'Healthcare', None, None, '3/3/2023', 'Post-IPO', 'United States', Decimal('458.00'), 10)
('Indigo', 'Boston', 'Other', None, None, '3/3/2023', 'Series F', 'United States.', Decimal('1200.00'), 11)
('MasterClass', 'SF Bay Area', 'Education', Decimal('79.00'), None, '3/2/2023', 'Series E', 'United States', Decimal('461.00'), 13)
('Ambev Tech', 'Blumenau', 'Food', Decimal('50.00'), No

### **3. Standardization**

- Data standardization is a fundamental step in data management to ensure data quality and consistency making it suitable for
  -  Analysis,
  - Reporting, and
  - Decision-making
- Overall, failing to standardize data can result in data that is difficult to work with, error-prone, and less reliable.

#### **How to standardize data in SQL table?**
- Checking for approprite data types.
- Checking for NULL values or missing values and replacing or dropping according to other column values.
- Checking data formatting

**Take Look at each Column values,appropriate data type and format**

In [None]:
# Consider column 'company'
query='''
  SELECT DISTINCT company
  FROM imdb.layoffs;
'''
cursor.execute(query)
for x in cursor:
  print(x)


('Atlassian',)
('SiriusXM',)
('Alerzo',)
('UpGrad',)
('Loft',)
('Embark Trucks',)
('Lendi',)
('UserTesting',)
('Airbnb',)
('Accolade',)
('Indigo',)
('Zscaler',)
('MasterClass',)
('Ambev Tech',)
('Fittr',)
('CNET',)
('Flipkart',)
('Kandela',)
('Truckstop.com',)
('Thoughtworks',)
('iFood',)
('Color Health',)
('Waymo',)
('PayFit',)
('Yellow.ai',)
('Sonder',)
('Protego Trust Bank',)
('Electronic Arts',)
('Eventbrite',)
('DUX Education',)
('MeridianLink',)
('Sono Motors',)
('Cerebral',)
('Amount',)
('Palantir',)
('Outreach',)
('Stytch',)
('BitSight',)
('Twitter',)
('Ericsson',)
('SAP Labs',)
('Velodyne Lidar',)
('Medallia',)
('Eat Just',)
('Lucira Health',)
('Stax',)
('Poshmark',)
('Merative',)
('OneFootball',)
('The Iconic',)
('EVgo',)
('StrongDM',)
('Dapper Labs',)
('Messari',)
('Vibrent Health',)
('Synamedia',)
('TaskUs',)
('Arch Oncology',)
('Immutable',)
('Jounce Therapeutics',)
('Locomation',)
('Polygon',)
('Crunchyroll',)
('Ethos Life',)
('Bolt',)
('Criteo',)
('Green Labs',)
('PeerSt

**Observations:**
- There are few comapany names which has white spaces in the beginning,let's remove those and make those uniform

In [None]:
#Remove extra white spaces before and after company name
query='''
 SELECT DISTINCT TRIM(company) AS company
  FROM imdb.layoffs;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Atlassian',)
('SiriusXM',)
('Alerzo',)
('UpGrad',)
('Loft',)
('Embark Trucks',)
('Lendi',)
('UserTesting',)
('Airbnb',)
('Accolade',)
('Indigo',)
('Zscaler',)
('MasterClass',)
('Ambev Tech',)
('Fittr',)
('CNET',)
('Flipkart',)
('Kandela',)
('Truckstop.com',)
('Thoughtworks',)
('iFood',)
('Color Health',)
('Waymo',)
('PayFit',)
('Yellow.ai',)
('Sonder',)
('Protego Trust Bank',)
('Electronic Arts',)
('Eventbrite',)
('DUX Education',)
('MeridianLink',)
('Sono Motors',)
('Cerebral',)
('Amount',)
('Palantir',)
('Outreach',)
('Stytch',)
('BitSight',)
('Twitter',)
('Ericsson',)
('SAP Labs',)
('Velodyne Lidar',)
('Medallia',)
('Eat Just',)
('Lucira Health',)
('Stax',)
('Poshmark',)
('Merative',)
('OneFootball',)
('The Iconic',)
('EVgo',)
('StrongDM',)
('Dapper Labs',)
('Messari',)
('Vibrent Health',)
('Synamedia',)
('TaskUs',)
('Arch Oncology',)
('Immutable',)
('Jounce Therapeutics',)
('Locomation',)
('Polygon',)
('Crunchyroll',)
('Ethos Life',)
('Bolt',)
('Criteo',)
('Green Labs',)
('PeerSt

In [33]:
#Update company name with TRIM(company)
query='''
UPDATE imdb.layoffs
SET company = TRIM(company);
'''
cursor.execute(query)
conn.commit()

In [None]:
#Check it's data type
query='''
SHOW COLUMNS FROM imdb.layoffs LIKE 'company';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('company', 'varchar(255)', 'YES', '', None, '')


In [None]:
#Check any NULL value
query='''
SELECT Count(*)
FROM imdb.layoffs
WHERE company IS NULL OR company='';
'''
cursor.execute(query)
for x in cursor:
  print(x)
print("company has",x[0],"missing values.")

(0,)
company has 0 missing values.


In [None]:
#Checking for duplicates
query='''
SELECT company, COUNT(*)
FROM imdb.layoffs
GROUP BY company
HAVING COUNT(*) > 1;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Alerzo', 2)
('Loft', 6)
('DocuSign', 2)
('Airbnb', 2)
('The RealReal', 2)
('Indigo', 2)
('MasterClass', 2)
('Convoy', 4)
('Wix', 2)
('Neon', 2)
('Jellysmack', 2)
('Sprinklr', 3)
('Divvy Homes', 3)
('Momentive', 2)
('Sonder', 3)
('Eventbrite', 2)
('Cerebral', 3)
('Amount', 2)
('Outreach', 2)
('Twitter', 4)
('Twilio', 2)
('Electric', 2)
('Velodyne Lidar', 2)
('iRobot', 2)
('Foodpanda', 3)
('Getir', 3)
('LinkedIn', 2)
('TripleLift', 2)
('Nate', 3)
('OneFootball', 2)
('Yahoo', 2)
('Deliveroo', 2)
('Shutterfly', 2)
('Clear Capital', 2)
('Tier Mobility', 2)
('Noom', 3)
('Vacasa', 2)
('Innovaccer', 2)
('Dapper Labs', 2)
('TaskUs', 2)
('Arch Oncology', 2)
('Veriff', 2)
('Immutable', 2)
('GoDaddy', 2)
('Affirm', 2)
('Equitybee', 2)
('Ethos Life', 3)
('Bolt', 3)
('PeerStreet', 3)
('Tencent', 2)
('Chipper Cash', 2)
('Spotify', 3)
('Doma', 3)
('Intel', 2)
('Booktopia', 2)
('Weedmaps', 2)
('Namogoo', 2)
('Gemini', 3)
('Swyftx', 2)
('Wayfair', 2)
('Aqua Security', 2)
('Swiggy', 5)
('Elemy', 2)
('T

**Checking for location**

In [None]:
#Check data type of location
query='''
SHOW COLUMNS FROM imdb.layoffs LIKE 'location';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('location', 'varchar(30)', 'YES', '', None, '')


In [None]:
#Check for formating of location
query='''
SELECT DISTINCT location
FROM imdb.layoffs;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Sydney',)
('New York City',)
('Ibadan',)
('Mumbai',)
('Sao Paulo',)
('SF Bay Area',)
('Seattle',)
('Boston',)
('Blumenau',)
('Pune',)
('Atlanta',)
('Tel Aviv',)
('Bengaluru',)
('Los Angeles',)
('Boise',)
('Chicago',)
('Paris',)
('Baton Rouge',)
('Munich',)
('Denver',)
('Albany',)
('Stockholm',)
('Milan',)
('Orlando',)
('Singapore',)
('London',)
('Jakarta',)
('Walldorf',)
('Jersey City',)
('San Antonio',)
('Ann Arbor',)
('Berlin',)
('Philadelphia',)
('Columbus',)
('Reno',)
('Amsterdam',)
('Portland',)
('Vancouver',)
('Washington D.C.',)
('St. Louis',)
('Tallinn',)
('Pittsburgh',)
('Phoenix',)
('Tokyo',)
('Lagos',)
('Seoul',)
('Chennai',)
('Shenzen',)
('Toronto',)
('Kiel',)
('Austin',)
('Milwaukee',)
('Waterloo',)
('Salt Lake City',)
('Gurugram',)
('Brisbane',)
('Lehi',)
('Burlington',)
('Bend',)
('Miami',)
('Las Vegas',)
('Melbourne',)
('Dublin',)
('Nashville',)
('Sacramento',)
('Buenos Aires',)
('Charlotte',)
('Mexico City',)
('Barcelona',)
('Oxford',)
('Calgary',)
('Boulder',)
('Wil

In [None]:
#Missing values in location
query='''
SELECT Count(*)
FROM imdb.layoffs
WHERE location IS NULL OR location='';
'''
cursor.execute(query)
for x in cursor:
  print(x)
print("location has",x[0],"missing values.")

(0,)
location has 0 missing values.


In [None]:
#Count of duplicates
query='''
SELECT location, COUNT(*)
FROM imdb.layoffs
GROUP BY location
HAVING COUNT(*) > 1;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Sydney', 37)
('New York City', 249)
('Ibadan', 2)
('Mumbai', 26)
('Sao Paulo', 62)
('SF Bay Area', 614)
('Seattle', 79)
('Boston', 109)
('Blumenau', 2)
('Atlanta', 21)
('Tel Aviv', 56)
('Bengaluru', 81)
('Los Angeles', 93)
('Boise', 2)
('Chicago', 35)
('Paris', 6)
('Munich', 5)
('Denver', 15)
('Stockholm', 19)
('Singapore', 35)
('London', 76)
('Jakarta', 24)
('San Antonio', 2)
('Ann Arbor', 2)
('Berlin', 56)
('Philadelphia', 8)
('Columbus', 10)
('Reno', 4)
('Amsterdam', 12)
('Portland', 19)
('Vancouver', 19)
('Washington D.C.', 22)
('St. Louis', 8)
('Tallinn', 5)
('Pittsburgh', 6)
('Phoenix', 12)
('Tokyo', 2)
('Lagos', 13)
('Seoul', 2)
('Chennai', 3)
('Shenzen', 7)
('Toronto', 58)
('Austin', 39)
('Milwaukee', 2)
('Waterloo', 5)
('Salt Lake City', 23)
('Gurugram', 18)
('Brisbane', 7)
('Lehi', 5)
('Bend', 2)
('Miami', 13)
('Las Vegas', 5)
('Melbourne', 13)
('Dublin', 4)
('Nashville', 4)
('Sacramento', 3)
('Buenos Aires', 6)
('Charlotte', 4)
('Mexico City', 4)
('Barcelona', 3)
('Calgary

**Observations**
- 'location' has data type varchar(30) which is appropriate.
- It has no missing values.
- There are multiple records for each locations.
- Don't need to change data type and format of text.

**Checking industry**

In [None]:
#Checking data type of industry
query='''
SHOW COLUMNS FROM imdb.layoffs LIKE 'industry';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('industry', 'varchar(255)', 'YES', '', None, '')


In [None]:
#Checking format of industry
query='''
SELECT DISTINCT industry
FROM imdb.layoffs;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Other',)
('Media',)
('Retail',)
('Education',)
('Real Estate',)
('Transportation',)
('Marketing',)
('',)
('Healthcare',)
('Security',)
('Food',)
('Fitness',)
('Consumer',)
('Logistics',)
('HR',)
('Support',)
('Travel',)
('Crypto',)
('Finance',)
('Sales',)
('Data',)
('Infrastructure',)
('Hardware',)
('Product',)
('Legal',)
('Construction',)
('Energy',)
('Manufacturing',)
('NULL',)
('Aerospace',)
('Recruiting',)
('Crypto Currency',)
('CryptoCurrency',)
('Fin-Tech',)


In [None]:
#Checking missing values for industry
query='''
SELECT Count(*)
FROM imdb.layoffs
WHERE industry IS NULL OR industry='' OR industry='NULL';
'''
cursor.execute(query)
for x in cursor:
  print(x)
print("industry has",x[0],"missing values.")

(4,)
industry has 4 missing values.


In [None]:
# Checking duplicates
query='''
SELECT industry, COUNT(*)
FROM imdb.layoffs
GROUP BY industry
HAVING COUNT(*) > 1;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Other', 129)
('Media', 95)
('Retail', 194)
('Education', 93)
('Real Estate', 117)
('Transportation', 147)
('Marketing', 139)
('', 3)
('Healthcare', 183)
('Security', 74)
('Food', 141)
('Fitness', 31)
('Consumer', 116)
('Logistics', 42)
('HR', 65)
('Support', 43)
('Travel', 66)
('Crypto', 99)
('Finance', 284)
('Data', 79)
('Sales', 37)
('Infrastructure', 43)
('Hardware', 17)
('Product', 35)
('Construction', 16)
('Legal', 13)
('Energy', 12)
('Manufacturing', 2)
('Recruiting', 28)
('Aerospace', 6)
('Crypto Currency', 2)
('Fin-Tech', 3)


**Observations**
- 'industry' has appropriate text formating.
- It has 4 missing values which are placed as empty and with 'NULL'.
- There are names 'Crypsto','Crypto Currency' and "CryptoCurrency",Might be same.Let's check those names and it's data.We can update 'Crypto Currency' and 'CryptoCurrency' to "Crypto".

In [None]:
#Check fields  with industry as 'Crypto Currency' and 'CryptoCurrency' and Crypto
query='''
SELECT *
FROM imdb.layoffs
WHERE industry='Crypto Currency' OR industry='CryptoCurrency' OR industry='Crypto';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Protego Trust Bank', 'Seattle', 'Crypto', None, Decimal('0.50'), '3/1/2023', 'Series A', 'United States', Decimal('70.00'), 27)
('Dapper Labs', 'Vancouver', 'Crypto', None, Decimal('0.20'), '2/23/2023', 'Series D', 'United States', Decimal('607.00'), 53)
('Messari', 'New York City', 'Crypto', None, Decimal('0.15'), '2/23/2023', 'Series B', 'United States', Decimal('61.00'), 54)
('Immutable', 'Sydney', 'Crypto', None, Decimal('0.11'), '2/22/2023', 'Series C', 'Australia', Decimal('279.00'), 59)
('Polygon', 'Bengaluru', 'Crypto', Decimal('100.00'), Decimal('0.20'), '2/21/2023', 'Unknown', 'India', Decimal('451.00'), 62)
('Fireblocks', 'New York City', 'Crypto', Decimal('30.00'), Decimal('0.05'), '2/20/2023', 'Series E', 'United States', Decimal('1000.00'), 71)
('Reserve', 'SF Bay Area', 'Crypto', None, None, '2/17/2023', 'Unknown', 'United States', None, 79)
('Magic Eden', 'SF Bay Area', 'Crypto', Decimal('22.00'), None, '2/13/2023', 'Series B', 'United States', Decimal('170.00'), 109)

In [34]:
#Update "CryptoCurrency" and "Crypto Currency" to "Crypto"
query='''
UPDATE imdb.layoffs
SET industry='Crypto'
WHERE industry='Crypto Currency' OR industry='CryptoCurrency';
'''
cursor.execute(query)
conn.commit()

In [35]:
#Check fields  with industry as 'Crypto Currency' and 'CryptoCurrency' and Crypto
query='''
SELECT *
FROM imdb.layoffs
WHERE industry='Crypto Currency' OR industry='CryptoCurrency' OR industry='Crypto' LIMIT 10;
'''
cursor.execute(query)
for x in cursor:
  print(x)


('Protego Trust Bank', 'Seattle', 'Crypto', None, Decimal('0.50'), '3/1/2023', 'Series A', 'United States', Decimal('70.00'), 47)
('Magic Eden', 'SF Bay Area', 'Crypto', Decimal('22.00'), None, '2/13/2023', 'Series B', 'United States', Decimal('170.00'), 69)
('WeTrade', 'Bengaluru', 'Crypto', None, Decimal('1.00'), '2/9/2023', 'Unknown', 'India', None, 99)
('Dapper Labs', 'Vancouver', 'Crypto', None, Decimal('0.20'), '2/23/2023', 'Series D', 'United States', Decimal('607.00'), 111)
('Messari', 'New York City', 'Crypto', None, Decimal('0.15'), '2/23/2023', 'Series B', 'United States', Decimal('61.00'), 117)
('Immutable', 'Sydney', 'Crypto', None, Decimal('0.11'), '2/22/2023', 'Series C', 'Australia', Decimal('279.00'), 127)
('Polygon', 'Bengaluru', 'Crypto', Decimal('100.00'), Decimal('0.20'), '2/21/2023', 'Unknown', 'India', Decimal('451.00'), 133)
('Protocol Labs', 'SF Bay Area', 'Crypto', Decimal('89.00'), Decimal('0.20'), '2/3/2023', 'Unknown', 'United States', Decimal('10.00'), 151

In [36]:
#View distinct industry
query='''
SELECT DISTINCT industry
FROM imdb.layoffs
WHERE industry IN('Crypto','CryptoCurrency','Crypto Currency');
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Crypto',)


In [None]:
#View distinct industry
query='''
SELECT DISTINCT industry
FROM imdb.layoffs ;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Other',)
('Media',)
('Retail',)
('Education',)
('Real Estate',)
('Transportation',)
('Marketing',)
('',)
('Healthcare',)
('Security',)
('Food',)
('Fitness',)
('Consumer',)
('Logistics',)
('HR',)
('Support',)
('Travel',)
('Crypto',)
('Finance',)
('Sales',)
('Data',)
('Infrastructure',)
('Hardware',)
('Product',)
('Legal',)
('Construction',)
('Energy',)
('Manufacturing',)
('NULL',)
('Aerospace',)
('Recruiting',)
('Fin-Tech',)


In [None]:
# Check NULL and "" values and update with appropriate industry
query='''
SELECT *
FROM imdb.layoffs
WHERE industry IS NULL OR industry='' OR industry='NULL';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Airbnb', 'SF Bay Area', '', Decimal('30.00'), None, '3/3/2023', 'Post-IPO', 'United States', Decimal('6400.00'), 9)
("Bally's Interactive", 'Providence', 'NULL', None, Decimal('0.15'), '1/18/2023', 'Post-IPO', 'United States', Decimal('946.00'), 237)
('Juul', 'SF Bay Area', '', Decimal('400.00'), Decimal('0.30'), '11/10/2022', 'Unknown', 'United States', Decimal('1500.00'), 501)
('Carvana', 'Phoenix', '', Decimal('2500.00'), Decimal('0.12'), '5/10/2022', 'Post-IPO', 'United States', Decimal('1600.00'), 1492)


**Observations**:
- "Airbnb" which is travel industry which is marked as '',we can update to 'Travel'
- 'Bally's Interactive' is a Gambling Facilities and Casinos industry can be categorised as 'Entertainment'.
- 'Juul' is e-ciggarate manufactuaring,which can be categorized as 'Consumer'.
- 'Carvana' is  Industry: Auto & Truck Dealerships ,so we can update to transportation.

In [None]:
#View the record with company 'Airbnb'
query='''
SELECT *
FROM imdb.layoffs
WHERE company='Airbnb';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Airbnb', 'SF Bay Area', '', Decimal('30.00'), None, '3/3/2023', 'Post-IPO', 'United States', Decimal('6400.00'), 9)
('Airbnb', 'SF Bay Area', 'Travel', Decimal('1900.00'), Decimal('0.25'), '5/5/2020', 'Private Equity', 'United States', Decimal('5400.00'), 1852)


In [39]:
#Update "Airbnb" to "Travel" Bally's Interactive to 'Entertainment' and 'Juul' To 'manufacturing' and "carvana" to 'retail'
query='''
UPDATE imdb.layoffs
SET industry='Travel'
WHERE company='Airbnb';
'''

cursor.execute(query)
conn.commit()


In [40]:
#View the record with company 'Airbnb' after update
query='''
SELECT *
FROM imdb.layoffs
WHERE company='Airbnb';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Airbnb', 'SF Bay Area', 'Travel', Decimal('30.00'), None, '3/3/2023', 'Post-IPO', 'United States', Decimal('6400.00'), 17)
('Airbnb', 'SF Bay Area', 'Travel', Decimal('1900.00'), Decimal('0.25'), '5/5/2020', 'Private Equity', 'United States', Decimal('5400.00'), 1859)


In [None]:
#View the record with company "Bally's Interactive' or 'Juul' or'Carvana'
query='''
SELECT * FROM imdb.layoffs WHERE company='Bally''s Interactive' OR company='Juul' OR company='Carvana';
'''
cursor.execute(query)
for x in cursor:
  print(x)

("Bally's Interactive", 'Providence', 'NULL', None, Decimal('0.15'), '1/18/2023', 'Post-IPO', 'United States', Decimal('946.00'), 331)
('Carvana', 'Phoenix', 'Transportation', None, None, '1/13/2023', 'Post-IPO', 'United States', Decimal('1600.00'), 372)
('Carvana', 'Phoenix', 'Transportation', Decimal('1500.00'), Decimal('0.08'), '11/18/2022', 'Post-IPO', 'United States', Decimal('1600.00'), 660)
('Juul', 'SF Bay Area', '', Decimal('400.00'), Decimal('0.30'), '11/10/2022', 'Unknown', 'United States', Decimal('1500.00'), 737)
('Carvana', 'Phoenix', '', Decimal('2500.00'), Decimal('0.12'), '5/10/2022', 'Post-IPO', 'United States', Decimal('1600.00'), 1596)
('Juul', 'SF Bay Area', 'Consumer', Decimal('900.00'), Decimal('0.30'), '5/5/2020', 'Unknown', 'United States', Decimal('1500.00'), 1853)


In [37]:
#Update
query='''
UPDATE imdb.layoffs
SET industry='Entertainment'
WHERE company='Bally''s Interactive';
'''
cursor.execute(query)
conn.commit()

In [38]:
#Check Bally's Interactive's updated record
query='''
SELECT * FROM imdb.layoffs WHERE company='Bally''s Interactive';
'''
cursor.execute(query)
for x in cursor:
  print(x)

("Bally's Interactive", 'Providence', 'Entertainment', None, Decimal('0.15'), '1/18/2023', 'Post-IPO', 'United States', Decimal('946.00'), 1177)


In [41]:
query='''
 UPDATE imdb.layoffs
 SET industry='Consumer'
 WHERE company='Juul';
 '''
cursor.execute(query)
conn.commit()
query='''
 UPDATE imdb.layoffs
 SET industry='Transportation'
 WHERE company='Carvana';
 '''
cursor.execute(query)
conn.commit()

In [42]:
#Check null and ''
query='''
SELECT *
FROM imdb.layoffs
WHERE industry IS NULL OR industry='' OR industry='NULL';
'''
cursor.execute(query)
for x in cursor:
  print(x)

In [43]:
#Check updated fiels
query='''
 SELECT * FROM imdb.layoffs WHERE industry IN('Travel','Entertainment','Consumer','Transportation') AND company IN('Airbnb','Bally''s Interactive','Juul','Carvana');
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Airbnb', 'SF Bay Area', 'Travel', Decimal('30.00'), None, '3/3/2023', 'Post-IPO', 'United States', Decimal('6400.00'), 17)
('Carvana', 'Phoenix', 'Transportation', Decimal('1500.00'), Decimal('0.08'), '11/18/2022', 'Post-IPO', 'United States', Decimal('1600.00'), 351)
('Juul', 'SF Bay Area', 'Consumer', Decimal('400.00'), Decimal('0.30'), '11/10/2022', 'Unknown', 'United States', Decimal('1500.00'), 384)
("Bally's Interactive", 'Providence', 'Entertainment', None, Decimal('0.15'), '1/18/2023', 'Post-IPO', 'United States', Decimal('946.00'), 1177)
('Carvana', 'Phoenix', 'Transportation', None, None, '1/13/2023', 'Post-IPO', 'United States', Decimal('1600.00'), 1350)
('Carvana', 'Phoenix', 'Transportation', Decimal('2500.00'), Decimal('0.12'), '5/10/2022', 'Post-IPO', 'United States', Decimal('1600.00'), 1487)
('Airbnb', 'SF Bay Area', 'Travel', Decimal('1900.00'), Decimal('0.25'), '5/5/2020', 'Private Equity', 'United States', Decimal('5400.00'), 1859)
('Juul', 'SF Bay Area', 'Consume

In [None]:
#Show the fields with company carvana
df[df['company']=='Carvana']

Unnamed: 0,company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions
371,Carvana,Phoenix,Transportation,,,1/13/2023,Post-IPO,United States,1600.0
659,Carvana,Phoenix,Transportation,1500.0,0.08,11/18/2022,Post-IPO,United States,1600.0
1595,Carvana,Phoenix,,2500.0,0.12,5/10/2022,Post-IPO,United States,1600.0


In [None]:
df[df['company']=='Juul']

Unnamed: 0,company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions
736,Juul,SF Bay Area,,400.0,0.3,11/10/2022,Unknown,United States,1500.0
1937,Juul,SF Bay Area,Consumer,900.0,0.3,5/5/2020,Unknown,United States,1500.0


**Checking total_paid_off**

In [None]:
#Checking data type
query='''
SHOW COLUMNS FROM imdb.layoffs LIKE 'total_laid_off';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('total_laid_off', 'decimal(10,2)', 'YES', '', None, '')


In [44]:
#Checking missing values
query='''
SELECT Count(*)
FROM imdb.layoffs
WHERE total_laid_off IS NULL OR total_laid_off='' OR total_laid_off='NULL';
'''
cursor.execute(query)
for x in cursor:
  print(x)
print("total_laid_off has",x[0],"missing values.")

(739,)
total_laid_off has 739 missing values.


In [None]:
#Viewing missing values
query='''
SELECT *
FROM imdb.layoffs
WHERE total_laid_off IS NULL OR total_laid_off='' OR total_laid_off='NULL' LIMIT 20;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Accolade', 'Seattle', 'Healthcare', None, None, '3/3/2023', 'Post-IPO', 'United States', Decimal('458.00'), 10)
('Indigo', 'Boston', 'Other', None, None, '3/3/2023', 'Series F', 'United States.', Decimal('1200.00'), 11)
('Flipkart', 'Bengaluru', 'Retail', None, None, '3/2/2023', 'Acquired', 'India', Decimal('12900.00'), 17)
('Kandela', 'Los Angeles', 'Consumer', None, Decimal('1.00'), '3/2/2023', 'Acquired', 'United States', None, 18)
('Truckstop.com', 'Boise', 'Logistics', None, None, '3/2/2023', 'Acquired', 'United States', None, 19)
('Protego Trust Bank', 'Seattle', 'Crypto', None, Decimal('0.50'), '3/1/2023', 'Series A', 'United States', Decimal('70.00'), 27)
('DUX Education', 'Bengaluru', 'Education', None, Decimal('1.00'), '2/28/2023', 'Unknown', 'India', None, 30)
('MeridianLink', 'Los Angeles', 'Finance', None, Decimal('0.09'), '2/28/2023', 'Post-IPO', 'United States', Decimal('485.00'), 31)
('Poshmark', 'SF Bay Area', 'Retail', None, Decimal('0.02'), '2/24/2023', 'Acquired',

In [45]:
#calculate total_laid_off by industry
query='''
SELECT industry,ROUND(AVG(total_laid_off))AS total_laid_off
FROM imdb.layoffs
GROUP BY industry;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Other', Decimal('459'))
('Sales', Decimal('389'))
('Media', Decimal('82'))
('Retail', Decimal('330'))
('Education', Decimal('199'))
('Real Estate', Decimal('231'))
('Logistics', Decimal('134'))
('Transportation', Decimal('310'))
('Marketing', Decimal('100'))
('Finance', Decimal('142'))
('Travel', Decimal('357'))
('Healthcare', Decimal('216'))
('Security', Decimal('107'))
('Infrastructure', Decimal('252'))
('Food', Decimal('243'))
('Support', Decimal('114'))
('Fitness', Decimal('417'))
('Consumer', Decimal('525'))
('HR', Decimal('68'))
('Crypto', Decimal('175'))
('Data', Decimal('103'))
('Hardware', Decimal('1383'))
('Product', Decimal('51'))
('Construction', Decimal('351'))
('Legal', Decimal('84'))
('Recruiting', Decimal('121'))
('Energy', Decimal('134'))
('Aerospace', Decimal('165'))
('Fin-Tech', Decimal('72'))
('Entertainment', None)
('Manufacturing', Decimal('20'))


In [None]:
query='''
SELECT company,industry,total_laid_off
FROM imdb.layoffs LIMIT 20
;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Atlassian', 'Other', Decimal('500.00'))
('SiriusXM', 'Media', Decimal('475.00'))
('Alerzo', 'Retail', Decimal('400.00'))
('UpGrad', 'Education', Decimal('120.00'))
('Loft', 'Real Estate', Decimal('340.00'))
('Embark Trucks', 'Transportation', Decimal('230.00'))
('Lendi', 'Real Estate', Decimal('100.00'))
('UserTesting', 'Marketing', Decimal('63.00'))
('Airbnb', 'Travel', Decimal('30.00'))
('Accolade', 'Healthcare', None)
('Indigo', 'Other', None)
('Zscaler', 'Security', Decimal('177.00'))
('MasterClass', 'Education', Decimal('79.00'))
('Ambev Tech', 'Food', Decimal('50.00'))
('Fittr', 'Fitness', Decimal('30.00'))
('CNET', 'Media', Decimal('12.00'))
('Flipkart', 'Retail', None)
('Kandela', 'Consumer', None)
('Truckstop.com', 'Logistics', None)
('Thoughtworks', 'Other', Decimal('500.00'))


In [55]:
conn.rollback()

In [57]:
#Update the total_laid_off by mean value with group by industry
query='''
 UPDATE imdb.layoffs AS l
JOIN (
  SELECT industry, AVG(total_laid_off) AS avg_layoffs
  FROM imdb.layoffs
  WHERE total_laid_off IS NOT NULL
  GROUP BY industry
) AS industry_avg
  ON l.industry = industry_avg.industry
SET l.total_laid_off = industry_avg.avg_layoffs
WHERE l.total_laid_off IS NULL;
'''
cursor.execute(query)
conn.commit()


In [58]:
#Update the value of missing total_laid_off with mean by group of industry
query='''
SELECT *
FROM imdb.layoffs
LIMIT 20;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Atlassian', 'Sydney', 'Other', Decimal('500.00'), Decimal('0.05'), '3/6/2023', 'Post-IPO', 'Australia', Decimal('210.00'), 1)
('DocuSign', 'SF Bay Area', 'Sales', Decimal('680.00'), Decimal('0.10'), '2/16/2023', 'Post-IPO', 'United States', Decimal('536.00'), 2)
('Pico Interactive', 'SF Bay Area', 'Other', Decimal('400.00'), Decimal('0.20'), '2/16/2023', 'Acquired', 'United States', Decimal('62.00'), 3)
('SiriusXM', 'New York City', 'Media', Decimal('475.00'), Decimal('0.08'), '3/6/2023', 'Post-IPO', 'United States', Decimal('525.00'), 4)
('Alerzo', 'Ibadan', 'Retail', Decimal('400.00'), Decimal('0.27'), '3/6/2023', 'Series B', 'Nigeria', Decimal('16.00'), 5)
('The RealReal', 'SF Bay Area', 'Retail', Decimal('230.00'), Decimal('0.07'), '2/16/2023', 'Post-IPO', 'United States', Decimal('356.00'), 6)
('UpGrad', 'Mumbai', 'Education', Decimal('120.00'), Decimal('0.36'), '3/6/2023', 'Unknown', 'India', Decimal('631.00'), 7)
('Smartsheet', 'Seattle', 'Other', Decimal('85.00'), Decimal('0.

In [None]:
df[df['company']=='Alerzo']

Unnamed: 0,company,location,industry,total_laid_off,percentage_laid_off,date,stage,country,funds_raised_millions
2,Alerzo,Ibadan,Retail,400.0,,3/6/2023,Series B,Nigeria,16.0
1016,Alerzo,Ibadan,Retail,,,9/2/2022,Series B,Nigeria,16.0


In [48]:
#updated record with imputation with mean by industry.
query='''
SELECT *
FROM imdb.layoffs
WHERE company='Alerzo';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Alerzo', 'Ibadan', 'Retail', Decimal('400.00'), None, '3/6/2023', 'Series B', 'Nigeria', Decimal('16.00'), 5)
('Alerzo', 'Ibadan', 'Retail', Decimal('330.00'), None, '9/2/2022', 'Series B', 'Nigeria', Decimal('16.00'), 797)


**Checking percentage_laid_off**

In [59]:
#checking percentage_laid_off
query='''
 SELECT * FROM imdb.layoffs
 WHERE percentage_laid_off IS NULL OR percentage_laid_off='' OR
 percentage_laid_off='NULL' LIMIT 20;
'''
cursor.execute(query)
for x in cursor:
  print(x)


('TaskUs', 'Los Angeles', 'Support', Decimal('52.00'), Decimal('0.00'), '6/21/2022', 'Post-IPO', 'United States', Decimal('279.00'), 1136)


In [66]:
conn.rollback()

In [71]:
#Update percentage_laid_off by industry
query='''
UPDATE imdb.layoffs AS l
JOIN (
  SELECT industry, AVG(percentage_laid_off) AS avg_percentage
  FROM imdb.layoffs
  WHERE percentage_laid_off IS NOT NULL OR percentage_laid_off=0
  GROUP BY industry
) AS industry_avg
  ON l.industry = industry_avg.industry
SET l.percentage_laid_off = industry_avg.avg_percentage
WHERE l.percentage_laid_off IS NULL OR percentage_laid_off=0;
'''
cursor.execute(query)
conn.commit()

In [61]:
#checking updated percentage_laid_off
query='''
 SELECT * FROM imdb.layoffs
 LIMIT 20;
'''
cursor.execute(query)
for x in cursor:
  print(x)


('Atlassian', 'Sydney', 'Other', Decimal('500.00'), Decimal('0.05'), '3/6/2023', 'Post-IPO', 'Australia', Decimal('210.00'), 1)
('DocuSign', 'SF Bay Area', 'Sales', Decimal('680.00'), Decimal('0.10'), '2/16/2023', 'Post-IPO', 'United States', Decimal('536.00'), 2)
('Pico Interactive', 'SF Bay Area', 'Other', Decimal('400.00'), Decimal('0.20'), '2/16/2023', 'Acquired', 'United States', Decimal('62.00'), 3)
('SiriusXM', 'New York City', 'Media', Decimal('475.00'), Decimal('0.08'), '3/6/2023', 'Post-IPO', 'United States', Decimal('525.00'), 4)
('Alerzo', 'Ibadan', 'Retail', Decimal('400.00'), Decimal('0.27'), '3/6/2023', 'Series B', 'Nigeria', Decimal('16.00'), 5)
('The RealReal', 'SF Bay Area', 'Retail', Decimal('230.00'), Decimal('0.07'), '2/16/2023', 'Post-IPO', 'United States', Decimal('356.00'), 6)
('UpGrad', 'Mumbai', 'Education', Decimal('120.00'), Decimal('0.36'), '3/6/2023', 'Unknown', 'India', Decimal('631.00'), 7)
('Smartsheet', 'Seattle', 'Other', Decimal('85.00'), Decimal('0.

In [72]:
#Checking count of missing value
query='''
SELECT Count(*)
FROM imdb.layoffs
WHERE percentage_laid_off IS NULL OR percentage_laid_off='' OR percentage_laid_off='NULL';
'''
cursor.execute(query)
for x in cursor:
  print(x)
print("percentage_laid_off has",x[0],"missing values.")

(0,)
percentage_laid_off has 0 missing values.


**Note: Generally, we can impute this values in EDA phase.**

**Checking date**

In [74]:
#Checking data type of date
query='''
SHOW COLUMNS FROM imdb.layoffs LIKE 'date';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('date', 'varchar(30)', 'YES', '', None, '')


**Observations**:
- Data type of date is varchar(30).
- Let's change to date

In [75]:
#View date values
query='''
SELECT DISTINCT date
FROM imdb.layoffs;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('3/6/2023',)
('2/16/2023',)
('3/3/2023',)
('2/15/2023',)
('3/2/2023',)
('3/1/2023',)
('2/14/2023',)
('2/28/2023',)
('2/27/2023',)
('2/13/2023',)
('2/26/2023',)
('2/25/2023',)
('2/24/2023',)
('2/12/2023',)
('2/10/2023',)
('2/9/2023',)
('2/8/2023',)
('2/23/2023',)
('2/7/2023',)
('2/22/2023',)
('2/6/2023',)
('2/21/2023',)
('2/5/2023',)
('2/3/2023',)
('2/20/2023',)
('2/2/2023',)
('2/19/2023',)
('2/17/2023',)
('12/7/2022',)
('2/1/2023',)
('12/6/2022',)
('12/5/2022',)
('1/31/2023',)
('12/3/2022',)
('12/2/2022',)
('12/1/2022',)
('11/30/2022',)
('1/30/2023',)
('11/29/2022',)
('1/29/2023',)
('1/28/2023',)
('1/27/2023',)
('11/28/2022',)
('11/27/2022',)
('11/26/2022',)
('11/25/2022',)
('11/24/2022',)
('11/23/2022',)
('11/22/2022',)
('11/21/2022',)
('11/15/2022',)
('11/14/2022',)
('11/19/2022',)
('11/18/2022',)
('11/17/2022',)
('11/11/2022',)
('11/10/2022',)
('11/9/2022',)
('11/16/2022',)
('11/8/2022',)
('11/7/2022',)
('11/6/2022',)
('11/4/2022',)
('11/3/2022',)
('10/19/2022',)
('10/18/2022',)
('

In [76]:
query='''
SELECT * FROM imdb.layoffs
LIMIT 10;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Atlassian', 'Sydney', 'Other', Decimal('500.00'), Decimal('0.05'), '3/6/2023', 'Post-IPO', 'Australia', Decimal('210.00'), 1)
('DocuSign', 'SF Bay Area', 'Sales', Decimal('680.00'), Decimal('0.10'), '2/16/2023', 'Post-IPO', 'United States', Decimal('536.00'), 2)
('Pico Interactive', 'SF Bay Area', 'Other', Decimal('400.00'), Decimal('0.20'), '2/16/2023', 'Acquired', 'United States', Decimal('62.00'), 3)
('SiriusXM', 'New York City', 'Media', Decimal('475.00'), Decimal('0.08'), '3/6/2023', 'Post-IPO', 'United States', Decimal('525.00'), 4)
('Alerzo', 'Ibadan', 'Retail', Decimal('400.00'), Decimal('0.27'), '3/6/2023', 'Series B', 'Nigeria', Decimal('16.00'), 5)
('The RealReal', 'SF Bay Area', 'Retail', Decimal('230.00'), Decimal('0.07'), '2/16/2023', 'Post-IPO', 'United States', Decimal('356.00'), 6)
('UpGrad', 'Mumbai', 'Education', Decimal('120.00'), Decimal('0.36'), '3/6/2023', 'Unknown', 'India', Decimal('631.00'), 7)
('Smartsheet', 'Seattle', 'Other', Decimal('85.00'), Decimal('0.

In [77]:
#Missing values
query='''
SELECT Count(*)
FROM imdb.layoffs
WHERE date IS NULL OR date='' OR date='NULL';
'''
cursor.execute(query)
for x in cursor:
  print(x)
print("date has",x[0],"missing values.")

(1,)
date has 1 missing values.


In [78]:
# Update date data type to date after correcting date format and handling 'NULL' values
query = '''
UPDATE imdb.layoffs
SET date = CASE
    WHEN date LIKE '%/%/%' THEN STR_TO_DATE(date, '%m/%d/%Y')
    WHEN date LIKE '%-%/%' THEN STR_TO_DATE(date, '%d-%m-%Y')
    WHEN date LIKE '%/%-%' THEN STR_TO_DATE(date, '%m/%d-%Y')
    WHEN date = 'NULL' OR date = '' THEN NULL  -- Handle 'NULL' and empty strings
    ELSE date  -- Keep existing format if it's already YYYY-MM-DD
END;
'''
cursor.execute(query)
conn.commit()

In [79]:
# Now, change the column data type to DATE
query = '''
ALTER TABLE imdb.layoffs
MODIFY COLUMN date DATE;
'''
cursor.execute(query)
conn.commit()

In [80]:
#Checking updated date datatype
query='''
SHOW COLUMNS FROM imdb.layoffs LIKE 'date';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('date', 'date', 'YES', '', None, '')


In [81]:
#Missing values in date
query='''
SELECT Count(*)
FROM imdb.layoffs
WHERE date IS NULL;
'''
cursor.execute(query)
for x in cursor:
  print(x)
print("date has",x[0],"missing values.")


(1,)
date has 1 missing values.


In [82]:
#Missing values in date
query='''
SELECT *
FROM imdb.layoffs
WHERE date IS NULL;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Blackbaud', 'Charleston', 'Other', Decimal('500.00'), Decimal('0.14'), None, 'Post-IPO', 'United States', None, 2131)


- We can update this value with current date or any appropriate date.
- At this stage, we will keep as it is.


**Checking stage**

In [83]:
#Checking data types for stage
query='''
SHOW COLUMNS FROM imdb.layoffs LIKE 'stage';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('stage', 'varchar(30)', 'YES', '', None, '')


In [84]:
#Checking stage value counts
query='''
SELECT DISTINCT stage
FROM imdb.layoffs;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Post-IPO',)
('Acquired',)
('Series B',)
('Unknown',)
('Series E',)
('Series G',)
('Series D',)
('Series F',)
('Series C',)
('Series A',)
('Subsidiary',)
('Series H',)
('Private Equity',)
('Seed',)
('Series J',)
('NULL',)
('Series I',)


In [85]:
#Checking missing values
query='''
SELECT Count(*)
FROM imdb.layoffs
WHERE stage IS NULL OR stage='Unknown' OR stage='NULL';
'''
cursor.execute(query)
for x in cursor:
  print(x)
print("stage has",x[0],"missing values.")

(398,)
stage has 398 missing values.


In [86]:
#View the missing values
query='''
SELECT *
FROM imdb.layoffs
WHERE stage IS NULL OR stage='Unknown' OR stage='NULL';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('UpGrad', 'Mumbai', 'Education', Decimal('120.00'), Decimal('0.36'), datetime.date(2023, 3, 6), 'Unknown', 'India', Decimal('631.00'), 7)
('Loft', 'Sao Paulo', 'Real Estate', Decimal('340.00'), Decimal('0.15'), datetime.date(2023, 3, 3), 'Unknown', 'Brazil', Decimal('788.00'), 9)
('Lendi', 'Sydney', 'Real Estate', Decimal('100.00'), Decimal('0.32'), datetime.date(2023, 3, 3), 'Unknown', 'Australia', Decimal('59.00'), 14)
('DUX Education', 'Bengaluru', 'Education', Decimal('199.00'), Decimal('1.00'), datetime.date(2023, 2, 28), 'Unknown', 'India', None, 52)
('Amount', 'Chicago', 'Finance', Decimal('130.00'), Decimal('0.25'), datetime.date(2023, 2, 27), 'Unknown', 'United States', Decimal('283.00'), 58)
('EMX Digital', 'New York City', 'Marketing', Decimal('100.00'), Decimal('1.00'), datetime.date(2023, 2, 13), 'Unknown', 'United States', None, 61)
('BitSight', 'Tel Aviv', 'Security', Decimal('40.00'), Decimal('0.16'), datetime.date(2023, 2, 26), 'Unknown', 'Israel', Decimal('401.00'), 

- We impute these values with mode group by Industry.

In [433]:
#Calculate total_laid_off by industry
query = """
 SELECT industry,stage,Count(*) AS stage_count
 FROM imdb.layoffs
 WHERE stage IS NOT NULL or stage !='Unknown'
 GROUP BY industry,stage
 ORDER BY stage_count desc ;
"""
cursor.execute(query)
for x in cursor:
  print(x)

('Finance', 'Unknown', 54)
('Healthcare', 'Post-IPO', 51)
('Finance', 'Series B', 50)
('Transportation', 'Post-IPO', 45)
('Finance', 'Post-IPO', 43)
('Finance', 'Series C', 42)
('Other', 'Post-IPO', 41)
('Retail', 'Post-IPO', 39)
('Crypto', 'Unknown', 39)
('Retail', 'Unknown', 36)
('Finance', 'Series D', 31)
('Consumer', 'Post-IPO', 31)
('Healthcare', 'Series C', 31)
('Marketing', 'Series C', 31)
('Consumer', 'Unknown', 29)
('Real Estate', 'Unknown', 28)
('Food', 'Unknown', 24)
('Marketing', 'Unknown', 23)
('Healthcare', 'Series B', 22)
('Transportation', 'Unknown', 22)
('Real Estate', 'Series C', 22)
('Retail', 'Series A', 22)
('Marketing', 'Post-IPO', 21)
('Food', 'Series C', 20)
('Healthcare', 'Unknown', 20)
('Healthcare', 'Series D', 20)
('Media', 'Post-IPO', 19)
('Real Estate', 'Series B', 19)
('Retail', 'Series C', 19)
('Media', 'Unknown', 19)
('Finance', 'Series A', 19)
('Transportation', 'Series D', 19)
('Retail', 'Series B', 18)
('Consumer', 'Acquired', 18)
('Security', 'Post-

In [8]:
query='''WITH stage_counts AS (
  SELECT
    industry,
    stage,
    COUNT(*) AS stage_count
  FROM imdb.layoffs
  WHERE stage IS NOT NULL AND stage != 'Unknown'
  GROUP BY industry, stage
),
ranked_stages AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY industry
           ORDER BY stage_count DESC
         ) AS rn
  FROM stage_counts
)
SELECT
  industry,
  stage,
  stage_count
FROM ranked_stages
WHERE stage IS NOT NULL AND stage != 'Unknown'
OR rn=1
;
'''
cursor.execute(query)
for x in cursor:
  print(x)


('Aerospace', 'Post-IPO', 1)
('Construction', 'Series B', 7)
('Construction', 'Series C', 1)
('Construction', 'Acquired', 1)
('Construction', 'Seed', 1)
('Construction', 'Post-IPO', 1)
('Construction', 'Series E', 1)
('Construction', 'Series D', 1)
('Construction', 'Series A', 1)
('Consumer', 'Post-IPO', 31)
('Consumer', 'Acquired', 18)
('Consumer', 'Series C', 9)
('Consumer', 'Series B', 8)
('Consumer', 'Series E', 6)
('Consumer', 'Series D', 6)
('Consumer', 'Series A', 3)
('Consumer', 'Seed', 2)
('Consumer', 'Series F', 2)
('Consumer', 'Series H', 1)
('Consumer', 'Series G', 1)
('Consumer', 'Series I', 1)
('Crypto', 'Series A', 14)
('Crypto', 'Series B', 14)
('Crypto', 'Series C', 10)
('Crypto', 'Series D', 7)
('Crypto', 'Post-IPO', 7)
('Crypto', 'Seed', 4)
('Crypto', 'Series E', 3)
('Crypto', 'Acquired', 2)
('Crypto', 'Private Equity', 1)
('Crypto', 'Series F', 1)
('Data', 'Post-IPO', 14)
('Data', 'Series C', 12)
('Data', 'Series B', 11)
('Data', 'Series E', 8)
('Data', 'Series D', 

In [13]:
#Update stage with 'Unknown' and stage is null with most frequent stages per industry
query='''
UPDATE imdb.layoffs
SET stage = (
  SELECT stage
  FROM (
    SELECT industry, stage, COUNT(*) AS stage_count
    FROM imdb.layoffs
    WHERE stage IS NOT NULL AND stage != 'Unknown' AND stage != 'NULL'
    GROUP BY industry, stage
  ) AS industry_stages
  WHERE industry_stages.industry = imdb.layoffs.industry
  ORDER BY stage_count DESC
  LIMIT 1
)
WHERE stage IS NULL OR stage = 'Unknown' OR stage='NULL';
'''

cursor.execute(query)
conn.commit()

In [14]:
#View stage
query='''
SELECT stage FROM imdb.layoffs
where stage ='NULL' OR stage='Unknown';
'''
cursor.execute(query)
for x in cursor:
  print(x)


In [16]:
#View dataset
query='''
SELECT distinct stage FROM imdb.layoffs;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Post-IPO',)
('Acquired',)
('Series B',)
('Series A',)
('Series C',)
('Series E',)
('Series G',)
('Series D',)
('Series F',)
('Subsidiary',)
('Series H',)
('Private Equity',)
('Seed',)
('Series J',)
('Series I',)


**Checking Country**

In [17]:
#View data type of country
query='''
SHOW COLUMNS FROM imdb.layoffs LIKE 'country';
'''
cursor.execute(query)
for x in cursor:
  print(x)

('country', 'varchar(255)', 'YES', '', None, '')


In [18]:
#View distinct values of country
query='''
SELECT DISTINCT country
FROM imdb.layoffs;
'''
cursor.execute(query)
for x in cursor:
  print(x)

('Australia',)
('United States',)
('Nigeria',)
('India',)
('Brazil',)
('Israel',)
('United States.',)
('France',)
('Germany',)
('Sweden',)
('Italy',)
('Singapore',)
('United Kingdom',)
('Indonesia',)
('Estonia',)
('Canada',)
('Ireland',)
('Japan',)
('South Korea',)
('China',)
('Finland',)
('Netherlands',)
('Spain',)
('Argentina',)
('Mexico',)
('Portugal',)
('Switzerland',)
('Egypt',)
('Chile',)
('Kenya',)
('Greece',)
('Poland',)
('Luxembourg',)
('Belgium',)
('Seychelles',)
('Norway',)
('Denmark',)
('Hong Kong',)
('New Zealand',)
('Malaysia',)
('Ghana',)
('Hungary',)
('Vietnam',)
('Austria',)
('Thailand',)
('Lithuania',)
('Senegal',)
('Pakistan',)
('United Arab Emirates',)
('Colombia',)
('Peru',)
('Bahrain',)
('Romania',)
('Turkey',)
('Russia',)
('Bulgaria',)
('South Africa',)
('Czech Republic',)
('Myanmar',)
('Uruguay',)


In [19]:
#VIEW ANY MISSING VALUES
query='''
SELECT Count(*)
FROM imdb.layoffs
WHERE country IS NULL OR country='' OR country='NULL';
'''
cursor.execute(query)
for x in cursor:
  print(x)
print("country has",x[0],"missing values.")


(0,)
country has 0 missing values.


**Observations**
- There are no any missing values in country field. Also,it's properly formatted.


----
### **Conclusions**
- Data cleaning is a fundamental process in preparing raw data for reliable analysis, and MySQL provides a robust set of tools to carry out this task efficiently. Throughout this cleaning process, several key strategies and SQL techniques were applied to ensure data quality, consistency, and readiness for downstream analytics.

- Using IS NULL, UPDATE, COALESCE(), and window functions, we were able to:

- Identify and handle missing values through imputation using fixed values, group-wise averages, or the most frequent category (mode).

- Eliminate duplicate records using ROW_NUMBER() and partitioning logic.

- Standardize text fields and categorical values to enforce consistency across entries (e.g., aligning variations like "Tech" vs. "technology").

- Normalize or standardize numeric data where applicable for downstream machine learning or statistical modeling tasks.

- Correct formatting issues in dates and handled null timestamps using strategies such as default dates or forward filling.

- Additionally, window functions and CTEs (Common Table Expressions) introduced in MySQL 8.0 allowed for more sophisticated data profiling and cleaning workflows, including group-based imputations and ordered operations.

- By implementing these practices, we ensured that the dataset:

  - Is free from inconsistencies and invalid entries,

  - Preserves meaningful patterns for analysis,

  - Maintains data integrity for decision-making or modeling.

Thus, effective data cleaning in MySQL not only improves data quality but also maximizes the value of the data pipeline by delivering trustworthy, analysis-ready datasets.