# Tech Layoff Analysis
We will be doing the cleaning and analysis of a tech layoff dataset. To give some context, the data is focused on the mass layoffs in the tech industry that started in 2022. Analysis will be done to find which companies layed off the most employees, the specific areas of tech, and more.

Since we are using plotly, the graphs will be interactable. If you wish to run the scripts that you see, there is a link to the dataset in the description of this project that you will see when you first enter this folder.

The first part of this analysis will involve an early overview and cleaning of the data. Through taking a quick look at the data in Excel, there seems to be a decent amount of cleaning to be done before any analysis can take place.

## Cleaning and Overview

In [1]:
import pandas as pd
import plotly.express as px

#Read in the data set and store it in a variable
df = pd.read_csv('tech_layoffs.csv')

#Get an overview of the data
df.head()

#Converting date strings to time and more useful breakdowns
df['Date'] = pd.to_datetime(df['reported_date'])
df['Year'] = pd.to_datetime(df['Date']).dt.strftime('%Y')
df['Month'] = df['Date'].dt.month

#Looking for empty values
df.isnull().sum()

company                            0
total_layoffs                      0
impacted_workforce_percentage      0
reported_date                      0
industry                           0
headquarter_location               0
sources                            0
status                             0
additional_notes                 467
Date                               0
Year                               0
Month                              0
dtype: int64

As I said earlier, there are some uneeded columns, such as 'additional_notes' and 'sources'. This was quite clear from just opening the file in Excel. They do give extra context to some data points, but don't add anything helpful for analysis. Regardless, we'll get rid of them.

In [2]:
#Removing the 'additional_notes' data
df.drop('additional_notes',inplace=True,axis=1)
#Removing the 'sources' data
df.drop('sources',inplace=True,axis=1)

Another important feature to check will be to see if the columns that should be filled with numbers have strings in them. Otherwise, this will get in the way of making visualizations.

In [3]:
#importing function that checks for numeric data types
from pandas.api.types import is_numeric_dtype

#loop scans through each column, printing the columns that aren't only filled numerically
cols = df.columns
for i in cols:
    if is_numeric_dtype(df[i]) == False:
        print(i)

company
total_layoffs
impacted_workforce_percentage
reported_date
industry
headquarter_location
status
Date
Year


Unfortunately, the two columns that are supposed to have numeric data types, 'total_layoffs' and 'impacted_workforce_percentage', include strings as well. These will have to be cleaned up before any analysis can begin.

In [4]:
#changes all not numeric values in column to zero, converts the rest to int
df['total_layoffs'] = df['total_layoffs'].apply(lambda x: 0 if not x.isnumeric() else int(x))

While perhaps not the best method of dealing with the lack of numerical data, changing non-existent values to zeroes allows us to not have to delete the rest of the associated row.

Aside from numeric data, there is also an issue with the 'industry' column. The data contained in it is not uniform and could be annoying to work with in its current format.

The industry data at times has multiple areas listed along with slightly different names for the same tech division, such as ecommerce and e-commerce. The best workaround would be to limit all sector names to the last word, as that seems to hold the most information, and then to get rid of any hyphens or capitalizations.

In [5]:
#create empty list
index = []
#loop to scan through industry column
for i in df['industry'].to_list():
    #splits each word in 'industry' into a list, then stores the last word in index
    index.append(i.split(' ')[-1])

#changes all to lower case for uniformity    
index = [x.lower() for x in index]
#removes hyphens
index = [x.replace("-", "") for x in index]

#stores the new industry names back in the original column, replacing old data
df['industry'] = index

Now the cleaning is done and the analysis can begin.

## Layoffs By Company

In [6]:
#sort top ten companies by layoffs
df2 = df.sort_values('total_layoffs',ascending=False).head(10)

#graphing code
px.bar(df2,x='company',y='total_layoffs',
    title='Layoffs By Company',
    labels={
        'company':'Company',
        'total_layoffs':'Layoffs'
    })

Most of the companies in the above graph are well known giants in the tech industry, particularly the first four, so it is not surprise that the layoff numbers scale to their size. Twitter, who recently went through quite the transition upon the arrival of Elon Musk at the helm, is surprisingly low despite all that was said on the news. However, in terms of scale, they are nowhere near as big as the giants at the front of this list. Another statistic to look at for further context would be what percentage of the workforce was part of the layoff. This would help scale the data accordingly.

In [7]:
#changes all not numeric values in column to zero, converts the rest to int
df2['impacted_workforce_percentage'] = df2['impacted_workforce_percentage'].apply(lambda x: 0 if not x.isnumeric() else int(x))

#only five of the companies had actual data for layoff percentage
df3 = df2.sort_values('impacted_workforce_percentage',ascending=False).head(5)

#graphing code
px.bar(df3,x='company',y='impacted_workforce_percentage',
    title = 'Layoff Scale By Company',
    labels = {
        'company':'Company',
        'impacted_workforce_percentage':'Layoffs (%)'
    })

Due to a lack of data, some of the above companies had to be dropped from the scale analysis. We now see that while Twitter only let go of 3740 employees, they had by far the greatest loss in workforce percentage, coming in at 70%. Meanwhile, Amazon, who layed-off 18000 employees, only lost 5% of its original workforce, thus not seeming as drastic of a change as Twitter.

## Tech Divisions

In [19]:
#summed the layoff numbers within each division
div = df.groupby('industry')[['total_layoffs']].sum()
#ordered by layoff number and kept the top 10
div = div.sort_values('total_layoffs',ascending=False).head(10)

#graphing code
px.bar(div,x=div.index, y=div.total_layoffs,
    title = 'Layoffs Within Each Tech Division',
    labels = {
        'industry':'Tech Division',
        'total_layoffs':'Layoffs'
    })

While some of the above divisions could perhaps be merged, saas with services or software for example, this is the situation that the data presents to us. The divisions with the greatest impact deal heavily with the same top ranking companies from earlier; Amazon, Alphabet, Meta, and Microsoft. Overhiring in positions, perhaps just for backup in some cases, seems to be the main issues at these companies. Due to their high earnings and budget, it is not a problem until a declining market hits, then the layoffs begin. The main takeaway then would be that while many non-crucial and senior roles are not secure in general, the bigger your employers are, the more likely you are to be layed off without an issue when they are 'struggling' financially.

## Layoffs by Sector

In [30]:
#graphing code
fig = px.pie(df,values='total_layoffs', names='status',
    title = 'Layoffs by Sector Type',
    labels = {
        'total_layoffs':'Layoffs',
        'status':'Sector'
    })

#fixing up graph size and adjusting title position
fig.update_layout(margin=dict(t=0, b=0, l=0, r=0),title_y=0.5,title_x=0.95)
fig.show()

Seeing as those companies with the greatest layoff numbers are all public, this is not a surprise. Pressure from investors is also a much bigger concern for them, and so they must generate revenue wherever possible at the expense of their employees.

## Layoff Trendline

In [31]:
#summing the layoffs by day
by_day = df.groupby('Date')[['total_layoffs']].sum()

#graphing code
px.line(by_day,x=by_day.index,y=by_day.total_layoffs,
    title = 'Layoff Timeline',
    labels = {
        'total_layoffs':'Layoffs'
    })

A worsening market and figuring out revenue and budget numbers at the start of the year are probably the main factors for these recent heavy spikes. It would be interesting to take a look at how the numbers compare overall for 2022 and 2023 despite only one month having passed.

In [32]:
#summing layoffs by year
by_year = df.groupby('Year')[['total_layoffs']].sum()

#graphing code
px.bar(by_year,x=by_year.index, y=by_year.total_layoffs,
    title='2022 vs 2023 Layoff Numbers',
    labels={
        'total_layoffs':'Layoffs'
    })

It is worrying that the 2022 layoffs have almost already been matched in 2023. Nonetheless, everything currently points at a more layoffs occurring. For an aspiring data analyst this is perhaps not the best motivation, but what can you do other than to keep trying.