Risky Projects

Identify projects that are at risk for going overbudget. A project is considered to be overbudget if the cost of all employees assigned to the project is greater than the budget of the project.


You'll need to prorate the cost of the employees to the duration of the project. For example, if the budget for a project that takes half a year to complete is $10K, then the total half-year salary of all employees assigned to the project should not exceed $10K. Salary is defined on a yearly basis, so be careful how to calculate salaries for the projects that last less or more than one year.


Output a list of projects that are overbudget with their project name, project budget, and prorated total employee expense (rounded to the next dollar amount).


HINT: to make it simpler, consider that all years have 365 days. You don't need to think about the leap years.

In [3]:
import pandas as pd
import numpy as np
from datetime import datetime

In [5]:
linkedin_projects = pd.read_csv("../CSV/linkedin_projects.csv")
linkedin_projects = linkedin_projects.iloc[:, :5]
linkedin_projects.head()

Unnamed: 0,id,title,budget,start_date,end_date
0,1,Project1,29498,2018-08-31,2019-03-13
1,2,Project2,32487,2018-01-27,2018-12-13
2,3,Project3,43909,2019-11-05,2019-12-09
3,4,Project4,15776,2018-06-28,2018-11-20
4,5,Project5,36268,2019-03-13,2020-01-02


In [7]:
linkedin_emp_projects = pd.read_csv("../CSV/linkedin_emp_projects.csv")
columns_to_keep = ["emp_id", "project_id"]
linkedin_emp_projects = linkedin_emp_projects[columns_to_keep]
linkedin_emp_projects.head()

Unnamed: 0,emp_id,project_id
0,10592,1
1,10593,2
2,10594,3
3,10595,4
4,10596,5


In [10]:
linkedin_employees = pd.read_csv("../CSV/linkedin_employees.csv")
columns_to_drop = ["Unnamed: 4", "Unnamed: 5", "Unnamed: 6"]
linkedin_employees = linkedin_employees.drop(columns=columns_to_drop)
linkedin_employees.head()

Unnamed: 0,id,first_name,last_name,salary
0,10592,Jennifer,Roberts,20204
1,10593,Haley,Ho,33154
2,10594,Eric,Mccarthy,32360
3,10595,Gina,Martinez,46388
4,10596,Jason,Fields,12348


In [11]:
df = pd.merge(linkedin_projects, linkedin_emp_projects, how = 'inner',left_on = ['id'], right_on=['project_id'])
df.head()

Unnamed: 0,id,title,budget,start_date,end_date,emp_id,project_id
0,1,Project1,29498,2018-08-31,2019-03-13,10592,1
1,1,Project1,29498,2018-08-31,2019-03-13,10642,1
2,2,Project2,32487,2018-01-27,2018-12-13,10593,2
3,2,Project2,32487,2018-01-27,2018-12-13,10643,2
4,3,Project3,43909,2019-11-05,2019-12-09,10594,3


In [12]:
df1 = pd.merge(df, linkedin_employees, how = 'inner',left_on = ['emp_id'], right_on=['id'])
df1.head()

Unnamed: 0,id_x,title,budget,start_date,end_date,emp_id,project_id,id_y,first_name,last_name,salary
0,1,Project1,29498,2018-08-31,2019-03-13,10592,1,10592,Jennifer,Roberts,20204
1,1,Project1,29498,2018-08-31,2019-03-13,10642,1,10642,Joshua,Salinas,48079
2,2,Project2,32487,2018-01-27,2018-12-13,10593,2,10593,Haley,Ho,33154
3,2,Project2,32487,2018-01-27,2018-12-13,10643,2,10643,Sarah,Briggs,27150
4,3,Project3,43909,2019-11-05,2019-12-09,10594,3,10594,Eric,Mccarthy,32360


In [13]:
df1['project_duration'] = (pd.to_datetime(df1['end_date']) - pd.to_datetime(df1['start_date'])).dt.days
df1.head()

Unnamed: 0,id_x,title,budget,start_date,end_date,emp_id,project_id,id_y,first_name,last_name,salary,project_duration
0,1,Project1,29498,2018-08-31,2019-03-13,10592,1,10592,Jennifer,Roberts,20204,194
1,1,Project1,29498,2018-08-31,2019-03-13,10642,1,10642,Joshua,Salinas,48079,194
2,2,Project2,32487,2018-01-27,2018-12-13,10593,2,10593,Haley,Ho,33154,320
3,2,Project2,32487,2018-01-27,2018-12-13,10643,2,10643,Sarah,Briggs,27150,320
4,3,Project3,43909,2019-11-05,2019-12-09,10594,3,10594,Eric,Mccarthy,32360,34


In [18]:
df_expense = df1.groupby('title')['salary'].sum().reset_index(name='expense').sort_values(by='expense', ascending=True)
df_expense.head(10)

Unnamed: 0,title,expense
41,Project47,25246
8,Project17,27423
4,Project13,28526
44,Project5,29748
6,Project15,34201
35,Project41,35229
19,Project27,42104
49,Project9,46341
2,Project11,47670
48,Project8,48560


In [19]:
df_budget_expense = pd.merge(df1, df_expense, how = 'left',left_on = ['title'], right_on=['title'])
df_budget_expense.head()

Unnamed: 0,id_x,title,budget,start_date,end_date,emp_id,project_id,id_y,first_name,last_name,salary,project_duration,expense
0,1,Project1,29498,2018-08-31,2019-03-13,10592,1,10592,Jennifer,Roberts,20204,194,68283
1,1,Project1,29498,2018-08-31,2019-03-13,10642,1,10642,Joshua,Salinas,48079,194,68283
2,2,Project2,32487,2018-01-27,2018-12-13,10593,2,10593,Haley,Ho,33154,320,60304
3,2,Project2,32487,2018-01-27,2018-12-13,10643,2,10643,Sarah,Briggs,27150,320,60304
4,3,Project3,43909,2019-11-05,2019-12-09,10594,3,10594,Eric,Mccarthy,32360,34,78363


In [20]:
df_budget_expense['prorated_expense'] = np.ceil(df_budget_expense['expense']*(df_budget_expense['project_duration'])/365)
df_budget_expense.head()

Unnamed: 0,id_x,title,budget,start_date,end_date,emp_id,project_id,id_y,first_name,last_name,salary,project_duration,expense,prorated_expense
0,1,Project1,29498,2018-08-31,2019-03-13,10592,1,10592,Jennifer,Roberts,20204,194,68283,36293.0
1,1,Project1,29498,2018-08-31,2019-03-13,10642,1,10642,Joshua,Salinas,48079,194,68283,36293.0
2,2,Project2,32487,2018-01-27,2018-12-13,10593,2,10593,Haley,Ho,33154,320,60304,52870.0
3,2,Project2,32487,2018-01-27,2018-12-13,10643,2,10643,Sarah,Briggs,27150,320,60304,52870.0
4,3,Project3,43909,2019-11-05,2019-12-09,10594,3,10594,Eric,Mccarthy,32360,34,78363,7300.0


Этот код создает новый столбец с именем 'prorated_expense' в DataFrame `df_budget_expense`. Давайте разберем код по шагам:

1. `df_budget_expense['expense']*(df_budget_expense['project_duration'])/365`: Это выражение вычисляет пропорциональные затраты для каждой строки. Умножение 'expense' на 'project_duration' дает общие затраты на проект, а деление на 365 приводит их к дневному уровню.

2. `np.ceil(...)`: Применяет функцию `np.ceil()`, которая выполняет округление вверх для каждого значения. Такое округление применяется, вероятно, потому, что затраты должны быть выражены в целых числах, например, в случае, если они представляют денежные суммы.

3. `df_budget_expense['prorated_expense'] = ...`: Создает новый столбец 'prorated_expense' в DataFrame `df_budget_expense` и присваивает ему рассчитанные пропорциональные затраты.

Таким образом, 'prorated_expense' будет содержать пропорциональные затраты, округленные вверх, для каждой строки в DataFrame `df_budget_expense`.

In [22]:
df_budget_expense['budget_diff'] = df_budget_expense['prorated_expense'] - df_budget_expense['budget']
df_budget_expense.head()

Unnamed: 0,id_x,title,budget,start_date,end_date,emp_id,project_id,id_y,first_name,last_name,salary,project_duration,expense,prorated_expense,budget_diff
0,1,Project1,29498,2018-08-31,2019-03-13,10592,1,10592,Jennifer,Roberts,20204,194,68283,36293.0,6795.0
1,1,Project1,29498,2018-08-31,2019-03-13,10642,1,10642,Joshua,Salinas,48079,194,68283,36293.0,6795.0
2,2,Project2,32487,2018-01-27,2018-12-13,10593,2,10593,Haley,Ho,33154,320,60304,52870.0,20383.0
3,2,Project2,32487,2018-01-27,2018-12-13,10643,2,10643,Sarah,Briggs,27150,320,60304,52870.0,20383.0
4,3,Project3,43909,2019-11-05,2019-12-09,10594,3,10594,Eric,Mccarthy,32360,34,78363,7300.0,-36609.0


In [23]:
df_over_budget = df_budget_expense[df_budget_expense["budget_diff"] > 0]
df_over_budget.head()

Unnamed: 0,id_x,title,budget,start_date,end_date,emp_id,project_id,id_y,first_name,last_name,salary,project_duration,expense,prorated_expense,budget_diff
0,1,Project1,29498,2018-08-31,2019-03-13,10592,1,10592,Jennifer,Roberts,20204,194,68283,36293.0,6795.0
1,1,Project1,29498,2018-08-31,2019-03-13,10642,1,10642,Joshua,Salinas,48079,194,68283,36293.0,6795.0
2,2,Project2,32487,2018-01-27,2018-12-13,10593,2,10593,Haley,Ho,33154,320,60304,52870.0,20383.0
3,2,Project2,32487,2018-01-27,2018-12-13,10643,2,10643,Sarah,Briggs,27150,320,60304,52870.0,20383.0
6,4,Project4,15776,2018-06-28,2018-11-20,10595,4,10595,Gina,Martinez,46388,145,77167,30656.0,14880.0


In [35]:
result = df_over_budget[['title','budget','prorated_expense']]
#result.loc[:7]
#result.iloc[:6, :]
result.head()

Unnamed: 0,title,budget,prorated_expense
0,Project1,29498,36293.0
1,Project1,29498,36293.0
2,Project2,32487,52870.0
3,Project2,32487,52870.0
6,Project4,15776,30656.0


In [45]:
result['project_num'] = result['title'].agg(lambda x: int(x.split('t')[1]))
result.head()

  result['project_num'] = result['title'].agg(lambda x: int(x.split('t')[1]))


Unnamed: 0,title,budget,prorated_expense,project_num
0,Project1,29498,36293.0,1
20,Project11,11705,31606.0,11
22,Project12,10468,62843.0,12
26,Project14,30014,36774.0,14
30,Project16,19922,21875.0,16


In [48]:
result = result.drop_duplicates().sort_values('project_num')
result.head(10)

Unnamed: 0,title,budget,prorated_expense,project_num
0,Project1,29498,36293.0,1
2,Project2,32487,52870.0,2
6,Project4,15776,30656.0,4
10,Project6,41611,63230.0,6
16,Project9,32341,44691.0,9
20,Project11,11705,31606.0,11
22,Project12,10468,62843.0,12
26,Project14,30014,36774.0,14
30,Project16,19922,21875.0,16
34,Project18,10302,46381.0,18


Solution Walkthrough
In this problem, we are given a dataset consisting of LinkedIn projects, employees, and their assignments. We need to identify projects that are at risk for going over budget, considering the cost of all employees assigned to the project and prorating their cost based on the project's duration.

We will use the pandas library to perform data manipulation and calculations. The solution involves merging multiple dataframes, calculating project duration, calculating prorated expenses, and filtering out projects that are overbudget.

Let's walk through the solution step by step.

Understanding The Data
The provided code imports necessary libraries and defines three separate dataframes - linkedin_projects, linkedin_emp_projects, and linkedin_employees. These dataframes contain information about projects, employee assignments, and employee details, respectively.

The Problem Statement
We need to identify projects that are at risk for going overbudget. A project is considered to be overbudget if the cost of all employees assigned to the project is greater than the budget of the project. We should prorate the cost of employees based on the duration of the project.

Breaking Down The Code
df = pd.merge(linkedin_projects, linkedin_emp_projects, how='inner', left_on=['id'], right_on=['project_id'])

This line merges the linkedin_projects and linkedin_emp_projects dataframes on the common column 'id' in linkedin_projects and 'project_id' in linkedin_emp_projects. The resulting dataframe df contains project and employee assignment information.
df1 = pd.merge(df, linkedin_employees, how='inner', left_on=['emp_id'], right_on=['id'])

This line further merges the df dataframe with the linkedin_employees dataframe, matching the employee IDs. The resulting dataframe df1 contains project information, employee assignment information, and employee details.
df1['project_duration'] = (pd.to_datetime(df1['end_date']) - pd.to_datetime(df1['start_date'])).dt.days

This line calculates the project duration by subtracting the 'start_date' from the 'end_date' of each project and converts it to days. The calculated project duration is added as a new column 'project_duration' in df1 dataframe.
df_expense = df1.groupby('title')['salary'].sum().reset_index(name='expense')

This line groups the df1 dataframe by 'title' (project name) and calculates the sum of 'salary' for each project. The result is stored in the df_expense dataframe with two columns - 'title' and 'expense'.
df_budget_expense = pd.merge(df1, df_expense, how='left', left_on=['title'], right_on=['title'])

This line merges the df1 dataframe with the df_expense dataframe based on the common column 'title'. The resulting dataframe df_budget_expense contains all the columns from df1 and an additional column 'expense' which represents the total salary expense for each project.
df_budget_expense['prorated_expense'] = np.ceil(df_budget_expense['expense'] * (df_budget_expense['project_duration']) / 365)

This line calculates the prorated expense for each project by multiplying the 'expense' (total salary) with 'project_duration' and dividing by 365 (considering each year has 365 days). The resulting prorated expense is rounded up to the nearest dollar using the np.ceil function. The calculated prorated expenses are added as a new column 'prorated_expense' in the df_budget_expense dataframe.
df_budget_expense['budget_diff'] = df_budget_expense['prorated_expense'] - df_budget_expense['budget']

This line calculates the budget difference for each project by subtracting the project's budget from the prorated expense. The resulting budget differences are stored in a new column 'budget_diff' in the df_budget_expense dataframe.
df_over_budget = df_budget_expense[df_budget_expense["budget_diff"] > 0]

This line filters the df_budget_expense dataframe to keep only those rows where the 'budget_diff' column is greater than 0, indicating the projects that are overbudget. The resulting dataframe df_over_budget contains the information for the projects that are at risk for going overbudget.
result = df_over_budget[['title', 'budget', 'prorated_expense']]

This line selects the columns 'title', 'budget', and 'prorated_expense' from the df_over_budget dataframe and assigns it to the result dataframe.
result = result.drop_duplicates().sort_values('title')

This line removes any duplicate rows from the result dataframe and sorts it based on the 'title' column in ascending order.
Bringing It All Together
The code performs the following steps:

Merges the project, employee assignment, and employee details dataframes into a single dataframe.
Calculates the project duration for each project.
Calculates the total salary expense for each project.
Calculates the prorated expense for each project.
Calculates the budget difference for each project.
Filters out the projects that are overbudget.
Selects and sorts the required columns for the final output.
Conclusion
The provided code uses pandas and numpy libraries to identify projects that are at risk for going overbudget. It performs data manipulations, calculations, and filtering to obtain the desired result.