# Practical Exam: Spectrum Shades LLC
Spectrum Shades LLC is a prominent supplier of concrete color solutions, offering a wide range of pigments and coloring systems used in various concrete applications, including decorative concrete, precast concrete, and concrete pavers. The company prides itself on delivering high-quality colorants that meet the unique needs of its diverse clientele, including contractors, architects, and construction companies.
</br></br>
The company has recently observed a growing number of customer complaints regarding inconsistent color quality in their products. The discrepancies have led to a decline in customer satisfaction and a potential increase in product returns.
By identifying and mitigating the factors causing color variations, the company can enhance product reliability, reduce customer complaints, and minimize return rates.
</br></br>
You are part of the data analysis team tasked with providing actionable insights to help Spectrum Shades LLC address the issues of inconsistent color quality and improve customer satisfaction.

# Task 1

Before you can start any analysis, you need to confirm that the data is accurate and reflects what you expect to see. 

It is known that there are some issues with the `production_data` table, and the data team have provided the following data description. 

Write a query to ensure the data matches the description provided, including identifying and cleaning all invalid values. You must match all column names and description criteria.
</br>

- You should start with the data in the file "production_data.csv".
- Your output should be a DataFrame named clean_data.
- All column names and values should match the table below.
</br>

| Column Name             | Criteria                                                                                         |
|--------------------------|--------------------------------------------------------------------------------------------------|
| batch_id | Discrete. Identifier for each batch. Missing values are not possible. |
| production_date | Date. Date when the batch was produced.|
| raw_material_supplier | Categorical. Supplier of the raw materials. (1='national_supplier', 2='international_supplier'). <br> Missing values should be replaced with 'national_supplier'.|
| pigment_type           | Nominal. Type of pigment used. ['type_a', 'type_b', 'type_c']. <br> Missing values should be replaced with 'other'. |
| pigment_quantity       | Continuous. Amount of pigment added (in kilograms) (Range: 1 - 100). <br> Missing values should be replaced with median. |
| mixing_time           | Continuous. Duration of the mixing process (in minutes). <br> Missing values should be replaced with mean, rounded to 2 decimal places. |
| mixing_speed          | Categorical. Speed of the mixing process represented as categories: 'Low', 'Medium', 'High'.</br> Missing values should be replaced with 'Not Specified'. |
| product_quality_score | Continuous. Overall quality score of the final product (rating on a scale of 1 to 10). <br> Missing values should be replaced with mean, rounded to 2 decimal places. |


In [39]:
# Write your answer to Task 1 here
import numpy as np
import pandas as pd
from datetime import datetime
import re
import os
from IPython.display import FileLink

#reading data
df1=pd.read_csv("production_data.csv")
column_list = list(df1.columns)

#Batch_ID
# Ensure batch_id is discrete (convert to appropriate type if needed)
# Additional string cleaning for batch_id
df1['batch_id'] = df1['batch_id'].astype("str")  # Remove whitespace
#print(df1.dtypes)

# Ensure it functions as a proper identifier
# Remove any duplicate batch_ids if they exist (keep first occurrence)
df1 = df1.drop_duplicates(subset=['batch_id'], keep='first')

#production_date
df1['production_date'] = pd.to_datetime(df1['production_date'], errors='coerce')
#df1['production_date']= df1['production_date'].dt.date
#print(df1)
#print(df1.dtypes)

#raw_material_supplier
def process_supplier_category(value):
    if pd.isna(value) or value is None:
        return 'national_supplier'
    elif value == 1:
        return 'national_supplier'
    elif value == 2:
        return 'international_supplier'
    else:
        return 'national_supplier'  # Default for unexpected values

df1['raw_material_supplier']=df1['raw_material_supplier'].apply(process_supplier_category)
df1["raw_material_supplier"] = df1["raw_material_supplier"].astype("category")
#print(df1.dtypes)
#print(df1['raw_material_supplier'])

#pigment_type
df1["pigment_type"] = df1["pigment_type"].replace("Type_C","type_c")
df1['pigment_type']=df1['pigment_type'].fillna("Other")
df1['pigment_type']=df1['pigment_type'].astype("category")
#print(df1.dtypes)
#print(df1['pigment_type'].unique())

#pigment_quantity
df1['pigment_quantity']=df1['pigment_quantity'].fillna(df1['pigment_quantity'].median()).round(2)

#print(df1['pigment_quantity'])

#mixing_time
df1['mixing_time']=df1['mixing_time'].fillna(df1['mixing_time'].mean().round(2))
#print(df1.dtypes)
#print(df1['mixing_time'])

#mixing_speed
df1['mixing_speed']=df1['mixing_speed'].fillna("Not Specified")
df1['mixing_speed']=df1['mixing_speed'].replace("-","Not Specified")
df1['mixing_speed']=df1['mixing_speed'].astype("category")
#print(df1.dtypes)
#print(df1['mixing_speed'].unique())

#product_quality_score	
df1['product_quality_score']=df1['product_quality_score'].fillna(df1['product_quality_score'].mean())
df1['product_quality_score']=df1['product_quality_score'].round(2)
#print(df1['product_quality_score'])

df1.describe()
print(df1.dtypes)

clean_data=df1


batch_id                         object
production_date          datetime64[ns]
raw_material_supplier          category
pigment_type                   category
pigment_quantity                float64
mixing_time                     float64
mixing_speed                   category
product_quality_score           float64
dtype: object


# Task 2

You want to understand how the supplier type and quantity of materials affect the final product attributes.

Calculate the average `product_quality_score` and `pigment_quantity` grouped by `raw_material_supplier`.

- You should start with the data in the file 'production_data.csv'. 
- Your output should be a DataFrame named aggregated_data.
- It should include the three columns: `raw_material_supplier`, `avg_product_quality_score`, and `avg_pigment_quantity`.
- Your answers should be rounded to 2 decimal places.


In [40]:
# Write your answer to Task 2 here
import numpy as np
import pandas as pd
from datetime import datetime
import re
import os
from IPython.display import FileLink

#reading data
df2=pd.read_csv("production_data.csv")
column_list = list(df2.columns)
Required_columns=clean_data.columns
print(Required_columns)

aggregated_data = df2.groupby('raw_material_supplier').agg({
    'product_quality_score': 'mean',
    'pigment_quantity': 'mean'
}).round(2).reset_index()

# Rename columns for clarity
aggregated_data.columns = ['raw_material_supplier', 'avg_product_quality_score', 'avg_pigment_quantity']
print(aggregated_data)

Index(['batch_id', 'production_date', 'raw_material_supplier', 'pigment_type',
       'pigment_quantity', 'mixing_time', 'mixing_speed',
       'product_quality_score'],
      dtype='object')
   raw_material_supplier  avg_product_quality_score  avg_pigment_quantity
0                      1                       8.02                 44.73
1                      2                       5.97                 34.91


# Task 3

To get more insight into the factors behind product quality, you want to filter the data to see an average product quality score for a specified set of results.

Identify the average `product_quality_score` for batches with a `raw_material_supplier` of 2 and a `pigment_quantity` greater than 35 kg.

Write a query to return the average `avg_product_quality_score` for these filtered batches. Use the original production data table, not the output of Task 2.

- You should start with the data in the file 'production_data.csv'. 
- Your output should be a DataFrame named pigment_data.
- It should consist of a 1-row DataFrame with 3 columns: `raw_material_supplier`, `pigment_quantity`, and `avg_product_quality_score`.
- Your answers should be rounded to 2 decimal places where appropriate.


In [41]:
# Write your answer to Task 3 here
#reading data
df3=pd.read_csv("production_data.csv")
column_list = list(df3.columns)
print(df3.dtypes)
#Filter - raw_material_supplier of 2 and a pigment_quantity greater than 35 kg.
df3=df3[(df3['raw_material_supplier'] == 2) & (df3['pigment_quantity'] > 35)]

#1-row DataFrame with 3 columns: raw_material_supplier, pigment_quantity, and avg_product_quality_score.
avg_product_quality_score=df3['product_quality_score'].mean()

avg_pigment_quantity=df3['pigment_quantity'].mean()

pigment_data = pd.DataFrame({'raw_material_supplier': [2],'pigment_quantity': [round(avg_pigment_quantity, 2)],'avg_product_quality_score': [round(avg_product_quality_score, 2)]})

pigment_data.reset_index(drop=True, inplace=True)


print(pigment_data)


batch_id                   int64
production_date           object
raw_material_supplier      int64
pigment_type              object
pigment_quantity         float64
mixing_time              float64
mixing_speed              object
product_quality_score    float64
dtype: object
   raw_material_supplier  pigment_quantity  avg_product_quality_score
0                      2             39.01                       5.97


# Task 4

In order to proceed with further analysis later, you need to analyze how various factors relate to product quality. Start by calculating the mean and standard deviation for the following columns: `pigment_quantity`, and `product_quality_score`. </br> These statistics will help in understanding the central tendency and variability of the data related to product quality.
</br> </br >
Next, calculate the Pearson correlation coefficient between the following variables: `pigment_quantity`, and `product_quality_score`.
</br>
These correlation coefficients will provide insights into the strength and direction of the relationships between the factors and overall product quality.


- You should start with the data in the file 'production_data.csv'.
- Calculate the mean and standard deviation for the columns pigment_quantity and product_quality_score as: `product_quality_score_mean`, `product_quality_score_sd`, `pigment_quantity_mean`, `pigment_quantity_sd`.
- Calculate the Pearson correlation coefficient between pigment_quantity and product_quality_score as: `corr_coef`
- Your output should be a DataFrame named product_quality.
- It should include the columns: `product_quality_score_mean`, `product_quality_score_sd`, `pigment_quantity_mean`, `pigment_quantity_sd`, `corr_coef`.
- Ensure that your answers are rounded to 2 decimal places.


In [42]:
# Write your answer to Task 4 here
df4=pd.read_csv("production_data.csv")
column_list = list(df4.columns)
print(df4.dtypes)

median_pigment = df4['pigment_quantity'].median()

df4['pigment_quantity'] = df4['pigment_quantity'].apply(lambda x: x if 1 <= x <= 100 else np.nan)

df4['pigment_quantity'].fillna(median_pigment, inplace=True)

mean_quality = round(df4['product_quality_score'].mean(), 2)

df4['product_quality_score'].fillna(mean_quality, inplace=True)

pigment_quantity_mean = round(df4['pigment_quantity'].mean(), 2)

pigment_quantity_sd = round(df4['pigment_quantity'].std(), 2)

product_quality_score_mean = round(df4['product_quality_score'].mean(), 2)

product_quality_score_sd = round(df4['product_quality_score'].std(), 2)

corr_coef = round(df4['pigment_quantity'].corr(df4['product_quality_score']), 2)

product_quality = pd.DataFrame([{
'product_quality_score_mean': product_quality_score_mean,
'product_quality_score_sd': product_quality_score_sd,
'pigment_quantity_mean': pigment_quantity_mean,
'pigment_quantity_sd': pigment_quantity_sd,
'corr_coef': corr_coef
}])

print(product_quality)
print(product_quality.dtypes)

batch_id                   int64
production_date           object
raw_material_supplier      int64
pigment_type              object
pigment_quantity         float64
mixing_time              float64
mixing_speed              object
product_quality_score    float64
dtype: object
   product_quality_score_mean  ...  corr_coef
0                        6.68  ...       0.49

[1 rows x 5 columns]
product_quality_score_mean    float64
product_quality_score_sd      float64
pigment_quantity_mean         float64
pigment_quantity_sd           float64
corr_coef                     float64
dtype: object


# FORMATTING AND NAMING CHECK
Use the code block below to check that your outputs are correctly named and formatted before you submit your project.

This code checks whether you have met our automarking requirements: that the specified DataFrames exist and contain the required columns. It then prints a table showing ✅ for each column that exists and ❌ for any that are missing, or if the DataFrame itself isn't available.

If a DataFrame or a column in a DataFrame doesn't exist, carefully check your code again.

IMPORTANT: even if your code passes the check below, this does not mean that your entire submission is correct. This is a check for naming and formatting only.

In [43]:
import pandas as pd

def check_columns(output_df, output_df_name, required_columns):
    results = []
    for col in required_columns:
        exists = col in output_df.columns
        results.append({'Dataset': output_df_name, 'Column': col, 'Exists': '✅' if exists else '❌'})
    return results

def safe_check(output_df_name, required_columns):
    results = []
    if output_df_name in globals():
        obj = globals()[output_df_name]
        if isinstance(obj, pd.DataFrame):
            results.extend(check_columns(obj, output_df_name, required_columns))
        elif isinstance(obj, str) and ("SELECT" in obj.upper() or "FROM" in obj.upper()):
            results.append({'Dataset': output_df_name, 'Column': '—', 'Exists': 'ℹ️ SQL query string'})
        else:
            results.append({'Dataset': output_df_name, 'Column': '—', 'Exists': '❌ Not a DataFrame or query'})
    else:
        results.append({'Dataset': output_df_name, 'Column': '—', 'Exists': '❌ Variable not defined'})
    return results

requirements = {
    'clean_data': ['production_date', 'pigment_type', 'mixing_time', 'mixing_speed'],
    'aggregated_data': ['raw_material_supplier', 'avg_product_quality_score', 'avg_pigment_quantity'],
    'pigment_data': ['raw_material_supplier', 'pigment_quantity', 'avg_product_quality_score'],
    'product_quality': ['product_quality_score_mean', 'product_quality_score_sd',
                        'pigment_quantity_mean', 'pigment_quantity_sd', 'corr_coef']
}

all_results = []
for output_df_name, cols in requirements.items():
    all_results += safe_check(output_df_name, cols)

check_results_df = pd.DataFrame(all_results)

print(check_results_df)

            Dataset                      Column Exists
0        clean_data             production_date      ✅
1        clean_data                pigment_type      ✅
2        clean_data                 mixing_time      ✅
3        clean_data                mixing_speed      ✅
4   aggregated_data       raw_material_supplier      ✅
5   aggregated_data   avg_product_quality_score      ✅
6   aggregated_data        avg_pigment_quantity      ✅
7      pigment_data       raw_material_supplier      ✅
8      pigment_data            pigment_quantity      ✅
9      pigment_data   avg_product_quality_score      ✅
10  product_quality  product_quality_score_mean      ✅
11  product_quality    product_quality_score_sd      ✅
12  product_quality       pigment_quantity_mean      ✅
13  product_quality         pigment_quantity_sd      ✅
14  product_quality                   corr_coef      ✅
