# Data Analyst: Analyzing Manufacturing Data to Optimize Production Efficiency

You’ve been hired as a Data Analyst on the *manufacturing team* at SpaceX. Your first task is to analyze data from one of the rocket production lines. The goal is to identify bottlenecks and inefficiencies in the process that are affecting the overall production time and suggest ways to improve throughput.

## Step 1: Generate Data

Here, we will use a Python script to generate pseudo-data to mock manufacturing datasets. Run the following cell to generate data in the dataset/generated directory.

In [5]:
!python3 ../src/data_generator.py

Data generated.


You have access to a dataset that includes the following columns for the past month:
- **Step_ID**: Identifier for the production step (e.g., Assembly, Welding, Inspection, Testing).
- **Duration (hours)**: Time taken for that step to complete.
- **Start_Time**: When the step started.
- **End_Time**: When the step ended.
- **Operator_ID**: The ID of the operator responsible for that step.
- **Machine_ID**: The ID of the machine used in that step (if applicable).
- **Fault_Flag**: A binary flag (1 or 0) indicating if there was a fault during the process that caused a delay.

## Step 2: Cleaning the Data

Step 1: Import the required modules for data manipulation.

**Pandas**: Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

**Numpy**: NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

In [9]:
import pandas as pd
import numpy as np

ModuleNotFoundError: No module named 'pandas'

Step 2: Load the dataset into Python via Pandas

In [None]:
df = pd.read_csv("../dataset/generated/synthetic_manufacturing_data.csv")


Step 3: Check for missing values. Handle this data by either dropping missing rows or filling them with null values.

In [None]:
df.isnull().sum() # This will show if any columns have missing values

df.dropna() # To drop rows with missing values
# OR
df.fillna(value=0) # To fill missing values with 0

Step 4: Verify that the **Start Time** and **End Time** columns are properly formatted as dates

In [None]:
df['Start_Time'] = pd.to_datetime(df['Start_Time'])
df['End_Time'] = pd.to_datetime(df['End_Time'])

Step 5: Ensure the **Duration** column is consistent with **Start Time** and **End Time**. You can calculate the
duration and compare:

In [None]:
df["Calculated_Duration"] = (df["End_Time"] - df["Start_Time"]).dt.total_seconds() / 3600