# **Project Name**    -



##### **Project Type**    - TATA CLASSIFICATION EDA PROJECT
##### **Contribution**    - Individual

# **Project Summary -**

### **Summary of EDA**  

1. **Dataset Overview**  
   - The dataset has **90,954 rows and 13 columns**, with no missing values.  
   - It includes **numerical features** (temperature, speed, torque, tool wear), a **categorical feature** (Product Type: L, M, H), and **failure indicators** (TWF, HDF, PWF, OSF, RNF).  

2. **Key Findings**  
   - **Temperature Variables**: Air and process temperatures are highly correlated.  
   - **Rotational Speed & Torque**: Weak negative correlation, meaning higher speed tends to reduce torque.  
   - **Tool Wear**: Does not strongly correlate with other variables but may affect failures.  
   - **Failure Indicators**: Failures are rare, as shown by very low mean values.  

3. **Visual Insights**  
   - **Histograms** show most numerical features are normally distributed.  
   - **Correlation Heatmap** confirms strong relationships between temperature variables and weak failure correlations.  
   - **Product Type Distribution** is fairly balanced.

# **GitHub Link -**

Provided GitHub Link here :https://github.com/Runal21/TATA-CLASSIFICATION-EDA-PROJECT

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

The objective is to predict machine failures in advance to reduce downtime, optimize performance, and lower maintenance costs.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Data Handling and Analysis
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load the dataset
df = pd.read_csv('test (1).csv')

# Show the first 5 rows to understand the data structure
df.head()

### Dataset First View

# Dataset First Look

The dataset consists of **machine operating parameters and failure indicators**.

**Key Columns:**

1. **Product ID & Type** → Identifies the product and its category (L, M, H).
2. **Air & Process Temperature (K)** → Measures machine temperature.
3. **Rotational Speed (RPM) & Torque (Nm)** → Key operational parameters.
4. **Tool Wear (min)** → Measures wear and tear over time.
5. **Failure Indicators (TWF, HDF, PWF, OSF, RNF)** → Binary values (0 or 1),
   indicating whether a failure occurred.

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# Display the number of rows and columns in the dataset
rows, columns = df.shape
print(f"The dataset contains {rows} rows and {columns} columns.")

### Dataset Information

In [None]:
# Display dataset information including column names, data types, and non-null values
df.info()


#### Duplicate Values

In [None]:
# Count the number of duplicate rows in the dataset
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows in the dataset: {duplicate_count}")


#### Missing Values/Null Values

In [None]:
# Count the number of missing/null values in each column
missing_values = df.isnull().sum()

# Display only columns with missing values (if any)
missing_values[missing_values > 0]

In [None]:
# Visualizing missing values using a heatmap
plt.figure(figsize=(10, 5))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()


### **Missing Values Visualization**  
- The heatmap confirms that **there are no missing values** in the dataset. ✅  
- Since all columns are fully populated, we can proceed with **EDA and visualizations** without concerns about data imputation. 🚀

### What did you know about your dataset?

### **Key Insights About the Dataset**  

1. **Dataset Overview**  
   - **90,954 rows** and **13 columns**  
   - Contains **machine operational parameters** and **failure indicators**  

2. **Data Quality**  
   - **No missing values** ✅  
   - **No duplicate rows** ✅  

3. **Feature Types**  
   - **Numerical Features**: Temperature, Speed, Torque, Tool Wear  
   - **Categorical Features**: Product Type (L, M, H)  
   - **Binary Features**: Failure Indicators (TWF, HDF, PWF, OSF, RNF)  

4. **Initial Observations**  
   - **Failures are rare events**, meaning failure indicators are mostly 0.  
   - **Temperatures are correlated**, indicating a relationship between air and process temperature.  
   - **Tool wear increases over time**, which may impact failures.  


## ***2. Understanding Your Variables***

In [None]:
# Display column names in the dataset
columns_list = df.columns.tolist()

columns_list


In [None]:
# Display summary statistics of numerical columns
summary_statistics = df.describe()
summary_statistics

### Variables Description

### **Dataset Columns**  
The dataset contains the following columns:  
- **id**: Unique identifier for each record  
- **Product ID**: Identifier for the product  
- **Type**: Categorical variable (L, M, H)  
- **Air temperature [K]**, **Process temperature [K]**: Machine temperature readings  
- **Rotational speed [rpm]**, **Torque [Nm]**: Machine operating parameters  
- **Tool wear [min]**: Measures tool usage over time  
- **TWF, HDF, PWF, OSF, RNF**: Binary failure indicators  

### **Dataset Summary Statistics**  
- **Temperatures**: Range between **295.3K to 304.4K** (air) and **305.7K to 313.8K** (process)  
- **Rotational Speed**: **1168 to 2886 RPM**, with an average of **1520 RPM**  
- **Torque**: **3.8 Nm to 76.6 Nm**, mean around **40.3 Nm**  
- **Tool Wear**: **0 to 253 minutes**, average **104 minutes**  
- **Failure Rates**: Very low failure occurrences (mostly **0s**)  

Next, we will conduct **Exploratory Data Analysis (EDA) with visualizations**.

### Check Unique Values for each variable.

In [None]:
# Count the number of unique values in each column
unique_values = df.nunique()

# Display unique value counts
unique_values


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# 1. Drop unnecessary columns (if any) - 'id' and 'Product ID' are identifiers and may not be useful for analysis
df_cleaned = df.drop(columns=['id', 'Product ID'])

# 2. Check and ensure correct data types
df_cleaned.dtypes

# 3. Convert categorical column 'Type' into numerical encoding (if required for modeling)
df_cleaned['Type'] = df_cleaned['Type'].astype('category')

# 4. Confirm changes
df_cleaned.info()

### What all manipulations have you done and insights you found?

### **Data Manipulations Done:**  
1. **Dropped Unnecessary Columns** → Removed `id` and `Product ID`, as they are unique identifiers and not useful for analysis.  
2. **Checked & Ensured Correct Data Types** →  
   - `Type` was converted to a **categorical variable**.  
   - Other data types (float, int) were correct.  
3. **Confirmed No Missing Values** → The dataset is **clean and complete** for analysis.  

### **Insights Found:**  
- **Machine Failures are Rare:** Failure indicators (TWF, HDF, PWF, OSF, RNF) have very few occurrences of `1`, suggesting failures are uncommon.  
- **Temperature Variables are Related:** Air and Process Temperatures are **highly correlated**, indicating a dependency.  
- **Wide Range of Operational Values:**  
  - **Rotational speed varies** from **1168 to 2886 RPM**.  
  - **Torque ranges** from **3.8 to 76.6 Nm**.  
  - **Tool wear can go up to 253 minutes**.  
- **Product Type (`L, M, H`) is an Important Categorical Variable** → Needs further analysis on failure distribution across types.  

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### UNIVARIATE ANALYSIS - COUNT PLOT: Product Type Distribution

In [None]:
# Plot the distribution of product types
plt.figure(figsize=(6, 4))
sns.countplot(x="Type", data=df_cleaned, palette="pastel")
plt.title("Distribution of Product Types", fontsize=14)
plt.xlabel("Product Type")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A count plot is used to visualize the frequency of different product types (L, M, H). This helps us understand the distribution of machine types in the dataset.

##### 2. What is/are the insight(s) found from the chart?

**Insights from the Chart:**

1. The dataset consists of **three product types**: **L, M, and H**.  
2. The distribution appears **balanced**, meaning no single type is overwhelmingly dominant.  
3. This balance ensures **fair analysis** across different product categories without bias.  

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact of the Insights**  

✅ **Positive Business Impact:**  
- A **balanced distribution** of product types (L, M, H) ensures that **failure analysis is unbiased** across different machine categories.  
- If failure rates differ significantly across types, businesses can **optimize manufacturing processes** for specific product categories.  
- Helps in **resource allocation**, ensuring maintenance teams focus on the most failure-prone machine types.  

❌ **Negative Growth Risks:**  
- If one product type has **significantly higher failures**, it could indicate **design flaws** or **poor maintenance practices**, leading to **increased operational costs**.  
- An **imbalance in product distribution** (one type dominating) could introduce **bias in predictive models**, affecting accuracy.

#### Chart - 2

In [None]:
# Chart 2: Air Temperature Distribution (test dataset)

plt.figure(figsize=(8, 5))
sns.histplot(df["Air temperature [K]"], bins=50, kde=True, color="skyblue")
plt.title("Distribution of Air Temperature [K]", fontsize=14)
plt.xlabel("Air Temperature [K]")
plt.ylabel("Frequency")
plt.show()

##### 1. Why did you pick the specific chart?

A **histogram** helps us understand how air temperature varies across machines. This is important because **temperature fluctuations can impact machine performance and failures**.  


##### 2. What is/are the insight(s) found from the chart?

**Insights from the Chart:**  
1. **Air temperature follows a normal distribution**, with most values ranging between **295K and 305K**.  
2. There are **no extreme outliers**, indicating stable temperature readings.  

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**  
✅ **Positive Impact:**  
- Consistent temperature range suggests **machines operate under stable conditions**, reducing risks.  
- Helps identify **temperature anomalies** that could indicate potential failures.  

❌ **Potential Risks:**  
- If failures are linked to specific temperature ranges, it could signal **overheating issues**, requiring **preventive maintenance**.  

i.e
Machines work best within a **specific temperature range**. If the air temperature is **too high**, it may cause **overheating**, which can **damage parts and lead to failures**.  

*For example:*  
- If we find that most machine failures happen when the **air temperature is above 303K**, this means **high temperature is a risk factor**.  
- In that case, companies can use **cooling systems** or **adjust operations** to prevent overheating and reduce failures.  


#### Chart - 3

In [None]:
# Chart 4: Box Plot of Process Temperature (test dataset)

plt.figure(figsize=(6, 5))
sns.boxplot(y=df["Process temperature [K]"], color="orange")
plt.title("Box Plot of Process Temperature [K]", fontsize=14)
plt.ylabel("Process Temperature [K]")
plt.show()


##### 1. Why did you pick the specific chart?

A **box plot** helps detect:  
- **Median process temperature**  
- **Temperature spread (range)**  
- **Potential outliers (extreme values that may indicate overheating issues)**  

##### 2. What is/are the insight(s) found from the chart?

**Insights from the Box Plot:**  
1. **Median Process Temperature**: The middle line in the box shows the typical process temperature value.  
2. **Temperature Spread**: The height of the box represents how much process temperature varies across machines.  
3. **Outliers**: If there are points **far outside the box**, it means some machines have unusually high or low process temperatures, which could indicate **potential risks**.  


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**  
✅ **Positive Impact:**  
- Understanding the **normal operating range** helps maintain stable machine performance.  
- Identifying **outliers** allows early detection of **machines at risk of failure** due to extreme temperatures.  

❌ **Potential Risks:**  
- If failures mostly occur at **higher process temperatures**, it may indicate **overheating issues**, requiring preventive maintenance.  
- **Too much variation** in temperature can suggest **inconsistent operating conditions**, which may affect machine lifespan.  


#### Chart - 4

In [None]:
# Chart 4: Scatter Plot - Process Temperature vs. Air Temperature

plt.figure(figsize=(8, 5))
sns.scatterplot(x=df["Air temperature [K]"], y=df["Process temperature [K]"], alpha=0.5, color="purple")
plt.title("Scatter Plot: Air Temperature vs. Process Temperature", fontsize=14)
plt.xlabel("Air Temperature [K]")
plt.ylabel("Process Temperature [K]")
plt.show()

##### 1. Why did you pick the specific chart?

 A **scatter plot** helps us understand the relationship between **air temperature and process temperature**.  
- If they have a **strong correlation**, it means **air temperature directly impacts process temperature**.  
- If there is **no pattern**, other factors might be affecting process temperature.

##### 2. What is/are the insight(s) found from the chart?

**Expected Insights from the Chart:**  
1. If the points form a **diagonal trend**, it means **process temperature increases with air temperature** (strong correlation).  
2. If the points are **scattered randomly**, air temperature **does not strongly influence process temperature**.  
3. Outliers (points far from the trend) may indicate **anomalous conditions** that require attention.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ **Positive Business Impact:**  
- **Predictability & Control**: If process temperature **strongly correlates** with air temperature, businesses can **predict overheating risks** and take **preventive actions** (e.g., cooling systems, ventilation improvements).  
- **Early Anomaly Detection**: Machines with **unexpected temperature behavior** (outliers) can be **flagged for maintenance**, reducing the risk of sudden failures.  
- **Improved Energy Efficiency**: Understanding how air temperature impacts process temperature can help **optimize energy use**, reducing costs.  

❌ **Potential Negative Growth Risks:**  
- If the scatter plot shows **high variability (weak correlation)**, it may indicate **uncontrolled process conditions**, making it **hard to predict failures**.  
- **Unstable process temperature** across different air temperatures may suggest **inefficient heat management**, leading to **higher maintenance costs** and **reduced machine lifespan**.  

#### Chart - 5

In [None]:
# Chart 5: Box Plot - Rotational Speed by Product Type

plt.figure(figsize=(8, 5))
sns.boxplot(x=df["Type"], y=df["Rotational speed [rpm]"], palette="coolwarm")
plt.title("Box Plot of Rotational Speed by Product Type", fontsize=14)
plt.xlabel("Product Type")
plt.ylabel("Rotational Speed [RPM]")
plt.show()

##### 1. Why did you pick the specific chart?

A **box plot** helps compare the **rotational speed** of different **product types (L, M, H)**.  
- This will show **which product type operates at higher or lower speeds**.  
- Identifies **outliers** (machines with unusually high or low speeds).  


##### 2. What is/are the insight(s) found from the chart?

**Expected Insights from the Chart:**

1. If one product type has **higher median speed**, it may indicate **differences in machine design**.  
2. **Wide boxes** or **many outliers** suggest **high variation in speed**, which could affect **performance and failure rates**.  
3. If failures **correlate with higher speeds**, it may suggest **mechanical stress** as a failure factor.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ **Positive Business Impact:**  
- **Understanding Product Differences**: If one product type operates at consistently **higher speeds**, manufacturers can **tailor maintenance schedules** to that product’s needs.  
- **Preventive Maintenance Optimization**: If machines with **higher rotational speed** have **more failures**, maintenance teams can focus on them to **prevent unexpected breakdowns**.  
- **Performance Benchmarking**: Helps in setting **optimal speed ranges** for each product type to **maximize efficiency and lifespan**.  

❌ **Potential Negative Growth Risks:**  
- If **one product type has significantly more variation in speed**, it may indicate **inconsistent manufacturing quality**, leading to **higher wear and tear**.  
- If failures **increase at higher speeds**, machines might be **operating beyond safe limits**, requiring **design improvements**.  

**Justification & Next Steps:**  
- If higher speeds **correlate with failures**, businesses can **adjust machine settings** to reduce failure rates.  
- If one product type has **high speed variability**, quality control improvements may be needed.

#### Chart - 6

In [None]:
# Chart 6: Correlation Heatmap (Fixing String Conversion Issue)

# Selecting only numeric columns
numeric_df = df.select_dtypes(include=['number'])

# Plot the correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Machine Parameters", fontsize=14)
plt.show()


##### 1. Why did you pick the specific chart?

✅ **1. Identifying Relationships Between Features**  
- A **correlation heatmap** helps us **quickly spot relationships** between different machine parameters.  
- If certain features are **highly correlated**, it means they **influence each other**, helping in failure prediction.  

✅ **2. Detecting Key Failure Causes**  
- If failure indicators (TWF, HDF, etc.) show **high correlation** with specific machine parameters (e.g., high torque, high speed), it helps in **understanding what causes failures**.  
- Businesses can then **optimize machine settings** to reduce breakdowns.  

✅ **3. Improving Predictive Models**  
- If we find that some features **don’t correlate at all** with failures, they might be **useless for prediction**, and we can **remove them to improve model accuracy**.  


##### 2. What is/are the insight(s) found from the chart?

**Insights from the Correlation Heatmap:**  

1. **Strong Positive Correlation (Close to +1, Dark Red Areas)**  
   - **Air Temperature [K]** and **Process Temperature [K]** → These two are **highly correlated**, meaning that **when air temperature increases, process temperature also increases**.  
   - This confirms that **external temperature significantly affects machine operations**.  

2. **Moderate Negative Correlation (Close to -1, Dark Blue Areas)**  
   - **Torque [Nm]** and **Rotational Speed [rpm]** → Usually, **higher speeds result in lower torque**, which is expected in mechanical systems.  
   - If failures are more common at **higher speeds**, reducing speed may **prevent breakdowns**.  

3. **Failure Indicators (TWF, HDF, etc.) and Machine Parameters**  
   - If failure types (TWF, HDF, etc.) show a **strong correlation** with specific machine parameters like **high tool wear or extreme temperatures**, these factors are **major contributors to machine failures**.  
   - Preventive maintenance can **focus on these critical parameters** to **reduce breakdowns**.  

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**  
✅ **Positive Impact:**  
- Identifying **high-risk factors** helps in **preventing machine failures**.  
- Companies can **optimize machine settings** to reduce failure rates and improve efficiency.  

❌ **Potential Risks:**  
- If failures are linked to **high-speed operation**, companies may need to **lower speed limits**, potentially reducing production efficiency.  
- Machines with **high temperature dependency** may require **extra cooling systems**, increasing **operational costs**.  

IDEA HOW AND WHY TORQUE AND RPM -VE CORRELATE

#### Chart - 7

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Chart: Scatter Plot - Torque vs. RPM
plt.figure(figsize=(8, 5))
sns.scatterplot(x=df["Rotational speed [rpm]"], y=df["Torque [Nm]"], alpha=0.5, color="red")
plt.title("Scatter Plot: Torque vs. Rotational Speed", fontsize=14)
plt.xlabel("Rotational Speed [RPM]")
plt.ylabel("Torque [Nm]")
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot shows whether Torque (Nm) and Rotational Speed (RPM) are inversely proportional.
If the points form a downward trend, it confirms a negative correlation (inverse relationship).

##### 2. What is/are the insight(s) found from the chart?

**Expected Insights from the Chart:**  
1. **If the points form a downward trend** → **Inverse relationship confirmed** (Higher RPM = Lower Torque).  
2. **If the points are scattered randomly** → No strong relationship between torque and RPM.


**Is Torque Inversely Proportional to RPM?**  

Yes, **torque (Nm) and rotational speed (RPM) are generally inversely proportional** in most mechanical systems, following the **power equation**:  

***Power = Torque × Rotational Speed***

Since **power is often constant in machines**, when **rotational speed (RPM) increases, torque decreases**, and vice versa.  

**Practical Example:**  
- In **low-speed operations**, machines generate **high torque** (e.g., heavy lifting).  
- In **high-speed operations**, torque is **lower** to maintain efficiency (e.g., fast-spinning tools).  


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact of Torque vs. RPM Analysis**  

✅ **Positive Business Impact:**  
- **Performance Optimization**: If torque and RPM have a strong inverse relationship, businesses can **optimize machine settings** to balance speed and power.  
- **Energy Efficiency**: Machines can be **calibrated to operate at ideal RPM-torque levels**, reducing unnecessary energy consumption.  
- **Preventive Maintenance**: If high-speed operations lead to **low torque and failures**, maintenance teams can **monitor speed variations** to prevent damage.  

❌ **Potential Risks (Negative Growth Factors):**  
- **High-Speed Failures**: If high RPM significantly reduces torque, machines might **struggle under heavy loads**, leading to **higher breakdown rates**.  
- **Wear and Tear**: Continuous **high-RPM, low-torque operation** may cause **excessive wear on machine parts**, increasing maintenance costs.  
- **Production Limitations**: If torque must be **kept high to prevent failures**, businesses may need to **slow down operations**, reducing output.  

**Justification & Next Steps:**  
- If failures increase at **higher speeds**, companies may **adjust speed limits** or **reinforce machine components**.  
- If torque drops **too low** at high speeds, alternative **gear mechanisms or cooling solutions** may be needed.  

#### Chart - 8

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Chart 8: Violin Plot - Tool Wear by Product Type
plt.figure(figsize=(8, 5))
sns.violinplot(x=df["Type"], y=df["Tool wear [min]"], palette="muted")
plt.title("Violin Plot of Tool Wear by Product Type", fontsize=14)
plt.xlabel("Product Type")
plt.ylabel("Tool Wear [min]")
plt.show()

##### 1. Why did you pick the specific chart?


A **violin plot** is a great way to visualize **how tool wear varies** across different **product types (L, M, H)**. It:  
- Shows the **distribution of tool wear** for each product type.  
- Helps identify **which product type experiences more wear and tear**.  
- Combines **box plot and density plot** to show **both range and frequency**.

##### 2. What is/are the insight(s) found from the chart?

**Insights from the Violin Plot (Tool Wear Distribution by Product Type)**  

1. **Tool Wear Varies by Product Type**  
   - If one product type (e.g., **L**) has **higher median tool wear**, it means **this type experiences more wear and tear** than others.  
   - If another type (e.g., **H**) has a **narrower and lower distribution**, it suggests **lower tool wear**, indicating better durability.  

2. **Some Machines Have Extremely High Tool Wear**  
   - If there are **long upper whiskers or extreme outliers**, some machines experience **very high tool wear**.  
   - This could indicate **overuse, poor material quality, or operational inefficiencies**.  

3. **Tool Wear Patterns Are Not the Same for All Products**  
   - If different product types have **different tool wear distributions**, it suggests **design or operational differences** between the machines.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**  

✅ **Positive Impact:**  
- **Identifying high tool wear machines** allows businesses to schedule **preventive maintenance**, reducing unexpected failures.  
- **Optimizing machine usage** for high-wear product types can **extend tool life** and lower costs.  

❌ **Potential Risks:**  
- **Machines with excessive tool wear may fail sooner**, leading to **higher downtime and maintenance costs**.  
- If one product type consistently has **higher tool wear**, it may indicate **a design flaw** that could lead to **customer complaints and reduced sales**.  

#### Chart - 9

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Chart 9: Box Plot - Tool Wear vs. Machine Failure
plt.figure(figsize=(8, 5))
sns.boxplot(x=df["TWF"], y=df["Tool wear [min]"], palette="Set2")
plt.title("Box Plot of Tool Wear vs. Machine Failure", fontsize=14)
plt.xlabel("Tool Wear Failure (TWF) (0 = No Failure, 1 = Failure)")
plt.ylabel("Tool Wear [min]")
plt.show()

##### 1. Why did you pick the specific chart?


A box plot will show the distribution of tool wear for machines that failed vs. did not fail.
If failed machines have higher tool wear, it means tool wear is a major cause of failures.

##### 2. What is/are the insight(s) found from the chart?

**Insights from the Box Plot (Tool Wear vs. Machine Failure)**  

1. **Higher Tool Wear Increases Failure Risk**  
   - If failed machines (**TWF = 1**) have a **higher median tool wear**, it suggests that **machines with excessive tool wear are more likely to fail**.  
   - This means **tool wear is a major contributor to failures**, and preventive maintenance should focus on machines with high wear levels.  

2. **Some Failures Occur Even at Low Tool Wear**  
   - If some failed machines have **low tool wear**, it indicates that **other factors (e.g., temperature, speed, torque) also contribute to failures**.  
   - This suggests that **failures are multi-factorial**, not solely caused by tool wear.  

3. **Outliers Suggest Critical Failures**  
   - If we see **outliers (extreme tool wear values) among failed machines**, it means some machines were **pushed beyond safe wear limits**, leading to breakdowns.  
   - Such cases require **urgent monitoring and intervention** to prevent future failures.  

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**  

✅ **Positive Impact:**  
- Businesses can **set a tool wear threshold** for maintenance, preventing unexpected breakdowns.  
- Helps in **optimizing tool replacement schedules**, reducing downtime and costs.  

❌ **Potential Risks:**  
- If failures happen **even at low tool wear**, relying only on tool wear for maintenance may **miss other critical failure causes**.  
- Machines with **extreme tool wear failures** may require **design changes** to improve durability.  

### **let'S think on point**  
ONLY by TWF failure occurs or **the impact of other factors (e.g., temperature, speed, torque) on failures** to get a complete picture?

#### Chart - 10

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Chart 10: Box Plot - Air & Process Temperature vs. Failures
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.boxplot(x=df["TWF"], y=df["Air temperature [K]"], palette="coolwarm", ax=axes[0])
axes[0].set_title("Air Temperature vs. Failures")
axes[0].set_xlabel("Tool Wear Failure (0 = No Failure, 1 = Failure)")
axes[0].set_ylabel("Air Temperature [K]")

sns.boxplot(x=df["TWF"], y=df["Process temperature [K]"], palette="coolwarm", ax=axes[1])
axes[1].set_title("Process Temperature vs. Failures")
axes[1].set_xlabel("Tool Wear Failure (0 = No Failure, 1 = Failure)")
axes[1].set_ylabel("Process Temperature [K]")

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?


- A **box plot** helps visualize whether **machines that failed had higher or lower air/process temperatures**.  
- If **failed machines show a higher median temperature**, it suggests that **overheating contributes to failures**.  

##### 2. What is/are the insight(s) found from the chart?

**Expected Insights:**  
1. If **failed machines have a higher median temperature**, **overheating is a key failure cause**.  
2. If **temperatures are similar for failed and non-failed machines**, temperature might **not be a major failure factor**.  

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact - Preventive Maintenance Optimization**  
1. **Early Warning System:** If failures occur at **high temperatures or extreme RPMs**, businesses can install **real-time monitoring systems** to alert operators before failures happen.  
2. **Smart Maintenance Scheduling:** Instead of **routine maintenance**, companies can **service machines only when operating near risky conditions**, **saving costs** and improving **efficiency**.  
3. **Longer Machine Lifespan:** Preventing **excessive wear and tear** by controlling **speed and torque fluctuations** helps machines last **longer**, reducing replacement costs.  

**❌ Negative Business Impact - Reduced Production Speed**  
1. **If machines fail at high speeds, companies may need to lower RPM limits**, which could **slow down production** and **reduce output per hour**.  
2. **If overheating is an issue, additional cooling systems may be required**, increasing **electricity costs**.  
3. **Machines may need to be shut down more often for preventive maintenance**, reducing **overall efficiency**.

#### Chart - 11

In [None]:
# Chart 11: KDE Plot - Rotational Speed vs. Failures
plt.figure(figsize=(8, 5))
sns.kdeplot(df[df["TWF"] == 0]["Rotational speed [rpm]"], label="No Failure (0)", shade=True, color="green")
sns.kdeplot(df[df["TWF"] == 1]["Rotational speed [rpm]"], label="Failure (1)", shade=True, color="red")
plt.title("KDE Plot of Rotational Speed vs. Failures", fontsize=14)
plt.xlabel("Rotational Speed [RPM]")
plt.ylabel("Density")
plt.legend()
plt.show()


##### 1. Why did you pick the specific chart?

- A **Kernel Density Estimation (KDE) plot** shows whether **machines that failed operated at extreme speeds**.  
- If failed machines cluster at **very high or very low speeds**, it suggests that **running too fast or too slow contributes to failures**.  


##### 2. What is/are the insight(s) found from the chart?


1. If failed machines have **higher density at extreme RPMs**, **speed fluctuations are a major risk factor**.  
2. If both failure and non-failure machines have **similar speed distributions**, speed might **not significantly impact failures**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact : Energy Efficiency & Cost Savings**  
1. **Reducing Overheating Risks:** If machines fail at **higher temperatures**, optimizing **cooling systems** or using **temperature-controlled environments** can **prevent failures and save energy**.  
2. **Balancing Speed & Energy Use:** If failures happen at **very high RPMs**, machines can be **run at optimal speeds** that balance **energy efficiency** and **low failure rates**.  
3. **Lower Downtime, Higher Productivity:** **Unplanned downtime** due to machine failures leads to **production delays**. Reducing failures ensures **continuous operation**, improving **output and revenue**.

**❌ Negative Business Impact - Increased Maintenance & Operational Costs**  
1. **If maintenance is scheduled too frequently based on speed/temperature thresholds, some machines may receive unnecessary servicing**, increasing **labor and material costs**.  
2. **Investing in new cooling solutions or reinforced machine parts could require significant capital investment**, affecting the company’s budget.  
3. **If torque fluctuations lead to failures, redesigning parts may take months, delaying production improvements**.

#### Chart - 12

In [None]:
# Chart 12: Scatter Plot - Torque vs. Failures
plt.figure(figsize=(8, 5))
sns.scatterplot(x=df["Torque [Nm]"], y=df["TWF"], alpha=0.5, color="purple")
plt.title("Scatter Plot of Torque vs. Machine Failures", fontsize=14)
plt.xlabel("Torque [Nm]")
plt.ylabel("Tool Wear Failure (0 = No Failure, 1 = Failure)")
plt.show()


##### 1. Why did you pick the specific chart?


- A **scatter plot** will show if **machines that failed were operating at higher torque values**.  
- If failures cluster at **high torque values**, it means **mechanical stress is a major failure cause**.

##### 2. What is/are the insight(s) found from the chart?

**Expected Insights:**  
1. If most failures happen at **higher torque values**, **machines experiencing excessive stress are at risk**.  
2. If failures occur at **random torque values**, torque might **not be a primary failure factor**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact - Improving Machine Design & Operations**  
1. **Redesigning High-Failure Components:** If failures happen **due to high torque**, machines can be **redesigned with stronger parts** to handle the load better.  
2. **Training Operators for Safe Usage:** If failures are caused by **operator handling**, training staff to operate within **safe temperature and speed limits** can **reduce damage and breakdowns**.  
3. **Using AI for Predictive Maintenance:** Data from temperature, speed, and torque can be fed into **AI-based predictive models** to **accurately forecast failures**, allowing **better resource planning**.

**❌ Negative Business Impact - Unpredictable Failure Risks**  
1. **If no clear relationship is found between failures and temperature, speed, or torque, failures might be random**, making prediction difficult.  
2. **If failures occur at normal operating levels, businesses may have to consider expensive redesigns**, which could slow down expansion plans.  
3. **If failures are multi-factorial (influenced by multiple conditions together), businesses may struggle to implement simple preventive strategies**, leading to **continued breakdowns**.

#### Chart - 13

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Chart 13: Pair Plot - Multi-Feature Relationships
sns.pairplot(df.sample(500), hue="TWF", diag_kind="hist", palette="coolwarm")
plt.show()


##### 1. Why did you pick the specific chart?

- A **pair plot** helps analyze multiple numerical variables together.  
- It shows **how different features interact** and whether any patterns exist in failed vs. non-failed machines.

##### 2. What is/are the insight(s) found from the chart?

  1. If failures cluster in **specific regions**, those features likely **contribute to failures**.  
  2. If failures are **randomly scattered**, failure causes may be **multi-factorial or unstructured**.
  3. Helps **identify outliers** and **feature relationships**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ **Positive Impact:**  
- Helps identify **strong relationships** between temperature, speed, torque, and failures.  
- If failures **cluster around specific values**, businesses can set **operating limits** to **prevent breakdowns**.  
- Reduces **unexpected downtime** and increases **machine lifespan** by running machines under **safe conditions**.  

❌ **Negative Impact:**  
- If failures occur **randomly** without strong correlations, it suggests **multi-factorial issues**, making it **harder to predict failures**.  
- Businesses may need to **collect more sensor data** or **invest in better failure detection systems**, increasing costs.

#### Chart - 14

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Chart 25: 3D Scatter Plot - Speed, Torque & Failures
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(df["Rotational speed [rpm]"], df["Torque [Nm]"], df["TWF"],
           c=df["TWF"], cmap="coolwarm", alpha=0.6)

ax.set_title("3D Scatter Plot: Speed vs. Torque vs. Failures")
ax.set_xlabel("Rotational Speed [RPM]")
ax.set_ylabel("Torque [Nm]")
ax.set_zlabel("Tool Wear Failure (0 = No Failure, 1 = Failure)")

plt.show()


##### 1. Why did you pick the specific chart?

 - Helps visualize the **relationship between rotational speed, torque, and failures** in **3D space**.  
- If failed machines cluster in **specific speed-torque combinations**, it confirms these as **risk factors**.

##### 2. What is/are the insight(s) found from the chart?

1. If failures are **grouped at high speeds and torques**, these factors strongly **affect failures**.  
2. If failures are **scattered randomly**, failures may have **other causes (temperature, wear, etc.)**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ **Positive Impact:**  
- If failures cluster in **high torque & speed regions**, businesses can **limit speed variations** and **optimize torque loads**.  
- Can be used to **train AI models** for **failure prediction**, improving preventive maintenance strategies.  
- Helps **reduce mechanical stress**, preventing **excessive wear and costly repairs**.  

❌ **Negative Impact:**  
- If reducing speed to prevent failures **slows down production**, it could lower **overall output and revenue**.  
- If businesses **overreact to the data**, they might **replace parts too frequently**, increasing **maintenance costs** unnecessarily.

#### Chart - 15


In [None]:
# Chart 15: Heatmap - Failure Types vs. Product Type
plt.figure(figsize=(8, 5))
failure_types = ["TWF", "HDF", "PWF", "OSF", "RNF"]
failure_counts = df.groupby("Type")[failure_types].sum()

sns.heatmap(failure_counts, annot=True, cmap="Reds", fmt="d")
plt.title("Heatmap of Failure Types by Product Type")
plt.xlabel("Failure Type")
plt.ylabel("Product Type")
plt.show()


##### 1. Why did you pick the specific chart?

- Helps analyze which **failure types (TWF, HDF, PWF, OSF, RNF)** are most common for each **product type (L, M, H)**.


##### 2. What is/are the insight(s) found from the chart?

1. If a **specific product type has more failures**, it may have a **design flaw or be used in harsher conditions**.  
2. If **certain failure types occur more often**, maintenance teams can **focus on preventing those failures first**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ **Positive Impact:**  
- Helps identify **which failure types occur most in specific products**, allowing **targeted design improvements**.  
- If one product type has **frequent tool wear failures**, businesses can **upgrade materials** to **increase durability**.  
- Prevents **customer complaints** and **reduces warranty claims**, improving **brand reputation**.

❌ **Negative Impact:**  
- If redesigning the most failure-prone product **is too expensive**, businesses may struggle with **budget constraints**.  
- If failures are linked to **external factors like workload or usage conditions**, redesigning products **may not fully solve the problem**.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**Business Recommendations to Achieve the Objective**  

Based on our **Exploratory Data Analysis (EDA)** and insights from **temperature, speed, torque, tool wear, and failures**, I recommend the following actions to achieve **better machine reliability and failure reduction**:  

---

**1️⃣ Implement Predictive Maintenance**  
✅ **Why?**  
- Machine failures often correlate with **high temperature, excessive speed, and tool wear**.  
- A **real-time monitoring system** can alert when machines reach **dangerous operating conditions**.  

✅ **How?**  
- Use **IoT sensors** to track **temperature, speed, and torque in real time**.  
- Set **thresholds** where machines automatically **adjust speed** or **alert for maintenance** before failures happen.  

---

**2️⃣ Optimize Operating Conditions**  
✅ **Why?**  
- If failures increase at **higher RPM or torque**, machines may be **operating beyond safe limits**.  
- Adjusting **speed and load settings** can **reduce stress on machine parts**.  

✅ **How?**  
- Define an **optimal RPM range** that minimizes failures while maintaining efficiency.  
- Use **AI-based automation** to **dynamically adjust speed and torque** based on load conditions.  

---

**3️⃣ Improve Product Design & Material Selection**  
✅ **Why?**  
- If **one product type (L, M, H) fails more often**, it may have a **design flaw or weaker material**.  
- Enhancing **tool durability** can **extend machine lifespan**.  

✅ **How?**  
- Analyze **failure heatmaps** to determine which product types **fail most often**.  
- Upgrade **materials or cooling systems** to handle high temperatures and stress better.  

---

**4️⃣ Reduce Unnecessary Maintenance Costs**  
✅ **Why?**  
- Over-maintenance can **increase costs without reducing failures**.  
- Maintenance should be **targeted based on real failure risks**.  

✅ **How?**  
- Use **AI-driven failure prediction models** to **schedule maintenance only when needed**.  
- Avoid replacing parts too early by analyzing **real wear rates** from historical data.  

---

**5️⃣ Train Operators for Better Machine Handling**  
✅ **Why?**  
- If failures are linked to **speed or torque spikes**, it could be due to **improper machine operation**.  
- Educating operators can **reduce stress on machines**.  

✅ **How?**  
- Implement **operator training programs** on **optimal machine handling**.  
- Use **dashboards** to provide live feedback on **best operating conditions**.

# **Conclusion**

**Final Recommendation**  
By combining **real-time monitoring, AI-driven predictive maintenance, optimized operations, and better product design**, businesses can:  
✔ **Reduce unexpected failures**  
✔ **Lower maintenance costs**  
✔ **Improve production efficiency**  
✔ **Extend machine lifespan**

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***