<h1>Startup Transformation</h1>

## Introduction
In this project, we are going to analyze a tech startup's data that is looking to improve its operations after a global pandemic has taken the world by storm.<br>

We will apply data transformation techniques to make better sense of the company’s data and also help answer important questions such as:<br>

Is the company in good financial health?<br>
Does the company need to let go of any employees?<br>
Should the company allow employees to work from home permanently?<br>
<hr>

First, we import necessary libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

We load the datasets, create data frames and clean the data before analyzing.

In [None]:
# Load datasets
financial_data = pd.read_csv('/kaggle/input/financial-data/financial_data.csv')
expense_overview = pd.read_csv('/kaggle/input/expenses/expenses.csv')
employees = pd.read_csv('/kaggle/input/employees/employees.csv')

In [None]:
print(financial_data)

In [None]:
print(expense_overview)

In [None]:
print(employees.head())

In [None]:
print(employees.info())

In [None]:
print(expense_overview.info())

In [None]:
print(financial_data.info())

In [None]:
print(employees.duplicated().sum())
print(expense_overview.duplicated().sum())
print(financial_data.duplicated().sum())

In [None]:
# Remove duplicate values
employees = employees.drop_duplicates()
print(employees.duplicated().sum())

Now, the data is ready for analysis.

In [None]:
print('The avarage salary is', round(employees['Salary'].mean(),2), 'dollars.')
print('The avarage productivity is', round(employees['Productivity'].mean(),2), 'percent.')
print('The avarage commute time is', round(employees['Commute Time'].mean(),2), 'minutes.')

In [None]:
month = financial_data['Month']
revenue = financial_data['Revenue']
expenses = financial_data['Expenses']

In [None]:
plt.plot(month, revenue)
plt.xlabel('Month')
plt.ylabel('Amount ($)')
plt.title('Revenue')
plt.show()

In [None]:
plt.clf()
plt.plot(month, expenses)
plt.xlabel('Month')
plt.ylabel('Amount ($)')
plt.title('Expenses')
plt.show()

As shown, revenue seems to be quickly decreasing while expenses are increasing. If the current trend continues, expenses will soon surpass revenues, putting the company at risk. Let's explore the data to determine which category constitutes the company's main cost.

In [None]:
expense_categories = expense_overview['Expense']
proportions = expense_overview['Proportion']

In [None]:
plt.clf()
plt.pie(proportions, labels = expense_categories)
plt.axis('Equal')
plt.tight_layout()
plt.show()

We simplify the pie chart by collapsing all categories making up less than 5% of the overall expensesto to help the management team see a big picture view of the company’s expenses without getting distracted by noisy data.

In [None]:
expense_categories = ['Salaries', 'Advertising', 'Office Rent', 'Other']
proportions = [0.62, 0.15, 0.15, 0.08]
plt.clf()
plt.pie(proportions, labels = expense_categories)
plt.title('Expense Categories')
plt.axis('Equal')
plt.tight_layout()
plt.show()

Salaries make up 62% of expenses. Therefore, to cut costs in a meaningful way, we can recommend the management to let go of some employees.

Each employee at the company is assigned a productivity score based on their work. We explore the relationship between Salary and Productivity more in depth. These two features are on vastly different scales, so we will standardize the data.

In [None]:
data_to_standardize = employees[['Salary', 'Productivity']]
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data_to_standardize)
standardized_df = pd.DataFrame(standardized_data, columns=['Standardized_Salary', 'Standardized_Productivity'])
standardized_employees = pd.concat([employees, standardized_df], axis=1)

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Standardized_Salary', y='Standardized_Productivity', data=standardized_employees)
plt.title('Productivity vs Salary')
plt.xlabel('Salary')
plt.ylabel('Productivity')
plt.show()

As shown above, more productive employees don't necessarily have higher salaries. Therefore, the best decision would be to keep the most highly productive employees and let go of the least productive employees.

In [None]:
sorted_productivity = employees.sort_values(by=['Productivity'])
print(sorted_productivity.head(10))

In [None]:
employees_cut = sorted_productivity.head(100)
print(employees_cut)

This is the list of employees that will not have a chance to stay at the company.

Now, we do some quick analysis on the commute times of employees to see whether it is worth it for the company to explore allowing remote work indefinitely so employees can save time during the day or not.

In [None]:
commute_times = employees['Commute Time']
print(commute_times.describe())


Let’s explore the shape of the commute time data using a histogram.

In [None]:
commute_times_log = np.log(commute_times)
plt.clf()
plt.hist(commute_times_log)
plt.title("Employee Commute Times")
plt.xlabel("Commute Time")
plt.ylabel("Frequency")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Productivity', y='Commute Time', data=employees)
plt.title('Commute Time vs Salary')
plt.xlabel('Commute Time')
plt.ylabel('Productivity')
plt.show()

We can observe a very weak negative correlation between productivity and commute time. Therefore, maintaining a remote workplace could be a beneficial strategy to ensure maximum productivity for all employees.

Finally, let's look at the relationship between employees, salary, productivity, and commute time using correlation matrix.

In [None]:
correlation_matrix = employees[['Salary', 'Productivity', 'Commute Time']].corr()
print(correlation_matrix)

The correlation between Salary and Productivity is approximately 0.018. This value is close to 0, indicating a very weak positive correlation. This means that there's almost no linear relationship between an employee's salary and their productivity.

The correlation between Salary and Commute Time is approximately 0.030. Similar to the previous case, this value is also very close to 0, indicating a very weak positive correlation. It suggests that there's almost no linear relationship between an employee's salary and their commute time.

The correlation between Productivity and Commute Time is approximately -0.061. This value is also close to 0, but negative. It indicates a very weak negative correlation. This means that there's a slight tendency that as an employee's productivity decreases, their commute time might slightly increase.