<a href="https://colab.research.google.com/github/RonBartov/Data_Processing/blob/main/factory_data_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Factory Data Processing**

# **General Background and Goals**
We have a table called 'factory_test.csv' that contains test data for 51 devices from the population of devices tested in the factory. We are defining the test data as the “**sample group**”.

In addition to the sample group we have a table called 'new_devices.csv' that contains data for 3 additonal devices. We are defining this data as the “**new sample group**”.

### **Main Goal**
Understand the "sample group data" and later to explore and understand the relation of "new sample group" to the "sample group" in terms of probability.

### **Side Goal**
Suggest and implement a method for testng new devices based on our knowledge
of the "sample group" and specify which devices (sample group and new devices) fail the test your tests and why.

# **Data Description**
Both "sample group" and "new sample group" datasets contains the following information:

${\circ}$ 5 **dependent** variables (features) Y1 - Y5.
<br>
${\circ}$ 4 **independent** variables X2 - X5.
<br>
${\circ}$ for i = 2,3,4,5 the feature ${Y_i}$ corresponds to ${X_i}$.
<br>
${\circ}$ The feature ${Y_1}$ has no corresponding variable.

###**Allowed Libraries**
1) Numpy
<br>
2) Pandas
<br>
3) Matplotlib

# **Part by Part**
We will divide this assignment into 5 different sections, each will be focucing on a different task as follows:

${\circ}$ **Section 1-** Data Loading and Description
<br>
${\circ}$ **Section 2-** Data Visualization and Explanaton
<br>
${\circ}$ **Section 3-** Data Preprocessing
<br>
${\circ}$ **Section 4-** Data Exploration
<br>
${\circ}$ **Section 5-** Suggest new testing method
<br>

# **Initial Assumption**
According to the description of the data we can assume that this assignment can be categorized as a "Linear Regression" problem.

Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable we want to predict is called the dependent variable (Y in our case) and the variable we are using in order to predict Y's value is called the independent variable, which is X in our case.


###**Using 'Notes'**
In orded to give some clear explenations regarding specific actions or decision we make, we will add a text box with a 'Note' title everytime we would like to explain something.

# **Functions that will be used throughout the assignment**

# **Import Libraries**

In [78]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import drive

# **Section 1- Data Loading and Description**
In this section we will load the data of sample group from 'factory_test.csv' and will provide a description of its contents.

We will present the following properties for the data:

${\circ}$ Data shape
<br>
${\circ}$ Range of values in each feature and variable
<br>
${\circ}$ Data distribution: Mean and variance.
<br>
${\circ}$ Check for missing values and if exist, we will replace them with a reasonable value  

In [79]:
# mount google drive into colab
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [80]:
# Loading the sample group data
csv_sample_group_path = r'/content/gdrive/MyDrive/Pulsenmore/factory_test_data.csv'
sample_group = pd.read_csv(csv_sample_group_path)

In [81]:
# Presenting the data frame in order to get initial understanding about the data
num_rows_to_present = 5
sample_group.head(num_rows_to_present)

Unnamed: 0,Y1,X2,Y2,X3,Y3,X4,Y4,X5,Y5
0,1491,18,16.8,14,13.33,12,11.12,9,8.33
1,1491,[],[],[],[],[],[],[],[]
2,2004,18,17.94,14,13.99,12,11.72,9,8.65
3,1493,[],[],[],[],[],[],[],[]
4,1497,18,18,14,13.59,12,12.91,9,7.82


In [82]:
# Sample group data frame general information
sample_group.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Y1      51 non-null     int64 
 1   X2      51 non-null     object
 2   Y2      51 non-null     object
 3   X3      51 non-null     object
 4   Y3      51 non-null     object
 5   X4      51 non-null     object
 6   Y4      51 non-null     object
 7   X5      51 non-null     object
 8   Y5      51 non-null     object
dtypes: int64(1), object(8)
memory usage: 3.7+ KB


In [84]:
# Convert all non numeric values to NaN
sample_group = sample_group.apply(pd.to_numeric, errors='coerce')
sample_group.head(num_rows_to_present)

Unnamed: 0,Y1,X2,Y2,X3,Y3,X4,Y4,X5,Y5
0,1491,18.0,16.8,14.0,13.33,12.0,11.12,9.0,8.33
1,1491,,,,,,,,
2,2004,18.0,17.94,14.0,13.99,12.0,11.72,9.0,8.65
3,1493,,,,,,,,
4,1497,18.0,18.0,14.0,13.59,12.0,12.91,9.0,7.82


In [85]:
# Count the NaN appearence
count_non_numeric_values = sample_group.isnull().sum()
print(f"Number of non numeric values in each column: \n{count_non_numeric_values}")

Number of non numeric values in each column: 
Y1    0
X2    4
Y2    4
X3    4
Y3    4
X4    4
Y4    4
X5    4
Y5    4
dtype: int64


In [86]:
# Replace all NaN values with some reasonable value that will maintain each column mean
means = sample_group.mean()
print(means)
for column in sample_group.columns:
    is_nan = pd.to_numeric(sample_group[column]).isnull()
    sample_group.loc[is_nan, column] = means[column]

sample_group.head(num_rows_to_present)

Y1    1504.960784
X2      18.000000
Y2      17.425957
X3      14.000000
Y3      13.452340
X4      12.000000
Y4      11.673830
X5       9.000000
Y5       8.346383
dtype: float64


Unnamed: 0,Y1,X2,Y2,X3,Y3,X4,Y4,X5,Y5
0,1491.0,18.0,16.8,14.0,13.33,12.0,11.12,9.0,8.33
1,1491.0,18.0,17.425957,14.0,13.45234,12.0,11.67383,9.0,8.346383
2,2004.0,18.0,17.94,14.0,13.99,12.0,11.72,9.0,8.65
3,1493.0,18.0,17.425957,14.0,13.45234,12.0,11.67383,9.0,8.346383
4,1497.0,18.0,18.0,14.0,13.59,12.0,12.91,9.0,7.82


In [87]:
# Present the summary statistics for each column
summary_stats = sample_group.describe()
print(summary_stats)

                Y1    X2         Y2    X3        Y3    X4         Y4    X5  \
count    51.000000  51.0  51.000000  51.0  51.00000  51.0  51.000000  51.0   
mean   1504.960784  18.0  17.425957  14.0  13.45234  12.0  11.673830   9.0   
std      71.384861   0.0   4.101133   0.0   0.11868   0.0   1.176656   0.0   
min    1487.000000  18.0   7.870000  14.0  13.19000  12.0   9.100000   9.0   
25%    1493.000000  18.0  15.455000  14.0  13.40000  12.0  11.070000   9.0   
50%    1495.000000  18.0  17.425957  14.0  13.45234  12.0  11.540000   9.0   
75%    1497.000000  18.0  20.180000  14.0  13.51500  12.0  12.350000   9.0   
max    2004.000000  18.0  25.260000  14.0  13.99000  12.0  14.560000   9.0   

              Y5  
count  51.000000  
mean    8.346383  
std     0.283728  
min     7.810000  
25%     8.210000  
50%     8.340000  
75%     8.505000  
max     9.160000  


In [88]:
# Recheck elements type
sample_group_cols_type = sample_group.dtypes
print(f"Data type of each column: \n{sample_group_cols_type}")

Data type of each column: 
Y1    float64
X2    float64
Y2    float64
X3    float64
Y3    float64
X4    float64
Y4    float64
X5    float64
Y5    float64
dtype: object


## **Note #1**
${\circ}$ The data shape is 51x9. Each row represent one of the tested devices while the columns represent the values of the different dependent features ${Y_i \space (i=1,2,3,4,5)}$, and their corresponding independent variables ${X_i \space (i=2,3,4,5)}$.

${\circ}$ Initially the data set contained some non numeric values. In order to fill those values with a reasonable values, for each column (feature or variable) we replaced them with the mean value that calculated according to the other numeric entries.

${\circ}$ We can describe the data in the following way:
- ${X_i}$ represent a specific attribute of the device. Each attribute is independent in the other attributes.
- ${Y_i}$ is the value obtained from a specific measurement during the test of the device, while the device has the attribute ${X_i}$
<br>
- Because for each ${i=2,3,4,5}$, ${X_i}$ is a constant, we know that all the 51 devices have the exact same value per attribute, i.e the measurement ${Y_i}$ have been taken from 51 different devices that have the same ${X_i}$.