In [1]:
import os
import sys
sys.path.append('C:\\Users\\Owner\\OneDrive\\Desktop\\MSc. Bradford\\MSc. Dissertation\\llm_experiment\\src')

import pandas as pd

import gpt

MODEL = 'gpt-3.5-turbo-0613'
SECRET_KEY = os.getenv("OPENAI_SECRET_KEY")
TEMPERATURE = 0.5
gpt_3_5 = gpt.GPT(MODEL,TEMPERATURE, SECRET_KEY)

#### Read in Prompt Templates

In [2]:
with open('../../../prompt_templates/data_understanding.txt') as file:
    data_understanding_pt = file.read()

with open('../../../prompt_templates/data_preparation.txt') as file:
    data_preparation_pt = file.read()

with open('../../../prompt_templates/modelling.txt') as file:
    modelling_pt = file.read()

#### Data Understanding

In [3]:
X = pd.read_csv('../../../data/traininginputs.csv')
y = pd.read_csv('../../../data/trainingoutput.csv')
pred_variables = X.columns[1:].tolist()
target_variable = y.columns[1]

In [4]:
prompt = data_understanding_pt.format(pred_variables, target_variable, None)
gpt_3_5.query_gpt(prompt)

To start the data understanding phase, we will first import the necessary libraries: pandas, numpy, and matplotlib. Pandas will be used for data manipulation and analysis, numpy for numerical operations, and matplotlib for data visualization.

Next, we will load the data into a pandas DataFrame and inspect the first few rows to get an initial understanding of the data structure.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the data into a pandas DataFrame
data = pd.read_csv('data.csv')

# Inspect the first few rows of the data
print(data.head())
```

After loading the data, we can generate some statistical summaries to gain insights into the data distribution and identify any potential issues.

```python
# Generate statistical summaries
print(data.describe())
```

This will provide summary statistics such as count, mean, standard deviation, minimum, and maximum values for each numerical variable. It will help us understand the range and distribution of the data.

Next, we can visualize the data using histograms, box plots, and scatter plots to further explore the relationships between variables and identify any outliers or patterns.

```python
# Visualize the data using histograms
data.hist(figsize=(10, 8))
plt.tight_layout()
plt.show()

# Visualize the data using box plots
data.boxplot(figsize=(10, 8))
plt.tight_layout()
plt.show()

# Visualize the relationships between variables using scatter plots
pd.plotting.scatter_matrix(data, figsize=(12, 12))
plt.tight_layout()
plt.show()
```

The histograms will give us an overview of the distribution of each variable, while the box plots will help identify any outliers. The scatter plots will show the relationships between pairs of variables, which can be useful for identifying correlations.

For the data preparation phase, some important things to note are:

1. Missing Values: Check for missing values in the dataset and decide on an appropriate strategy for handling them. This could involve imputing missing values or removing rows/columns with missing values.

2. Categorical Variables: Identify if there are any categorical variables in the dataset and consider encoding them appropriately for machine learning algorithms.

3. Feature Scaling: Depending on the machine learning algorithms to be used, it may be necessary to scale the predictor variables to a similar range to avoid biasing the model towards variables with larger magnitudes.

4. Feature Selection: Evaluate the relevance and importance of predictor variables in relation to the target variable. Consider performing feature selection techniques to select the most informative variables for the model.

5. Data Split: Split the dataset into training and testing sets to evaluate the model's performance on unseen data. This can be done using techniques such as cross-validation or a simple train-test split.

By following these steps, we can gain a better understanding of the data and prepare it appropriately for building a machine learning model.

### Data Preparation

In [5]:
quality_report = ['There are outliers present as observed from the boxplot',
                    'The OP100_Capuchon_insertion_mesure feature has some missing values']
exploration_report = ['Some features have a skewed distribution',
                        'The target variable has an imbalanced class distribution',
                        'There is little or no correlation between the predictor and target variables',
                        'Some predictor variables are correlated']

In [6]:
prompt = data_preparation_pt.format(pred_variables, target_variable, quality_report, exploration_report, None)
gpt_3_5.query_gpt(prompt)

To prepare the data for modeling, we need to address the issues mentioned in the data quality report and data exploration report. Here's how we can handle each issue:

1. Outliers: Since outliers are present in the data, we can consider using a robust scaler, such as the RobustScaler from the scikit-learn library, to scale the predictor variables. This scaler is less sensitive to outliers and can help mitigate their impact on the model.

2. Missing Values: The feature 'OP100_Capuchon_insertion_mesure' has missing values. We can handle this by either imputing the missing values or removing the rows/columns with missing values. To impute missing values, we can use techniques such as mean imputation or regression imputation from libraries like scikit-learn or pandas.

3. Skewed Distribution: As some features have a skewed distribution, we can apply a transformation to make the distribution more symmetric. One common transformation is the logarithmic transformation, which can be applied using the numpy library's log function.

4. Imbalanced Class Distribution: Since the target variable has an imbalanced class distribution, we need to address this issue to prevent the model from being biased towards the majority class. Techniques such as oversampling the minority class (e.g., using the SMOTE algorithm from the imbalanced-learn library) or undersampling the majority class can be applied to balance the classes.

5. Correlation: Although there is little or no correlation between the predictor and target variables, some predictor variables are correlated. This can cause multicollinearity issues in the model. It is important to address this by either removing one of the correlated variables or using techniques like principal component analysis (PCA) to reduce the dimensionality of the dataset.

Here's an example of how the data preparation phase could be implemented:

```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import RobustScaler
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE

# Load the data into a pandas DataFrame
data = pd.read_csv('data.csv')

# Handling missing values
imputer = SimpleImputer(strategy='mean')
data['OP100_Capuchon_insertion_mesure'] = imputer.fit_transform(data['OP100_Capuchon_insertion_mesure'].values.reshape(-1, 1))

# Handling outliers
scaler = RobustScaler()
data[predictor_variables] = scaler.fit_transform(data[predictor_variables])

# Handling skewed distribution
data[predictor_variables] = np.log1p(data[predictor_variables])

# Handling imbalanced class distribution
X = data[predictor_variables]
y = data['OP130_Resultat_Global_v']
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)

# Summary of important things to note for the modeling phase:
# - Choose appropriate evaluation metrics for imbalanced classification, such as precision, recall, F1-score, or area under the ROC curve.
# - Consider using ensemble methods, such as random forests or gradient boosting, which can handle imbalanced datasets better.
# - Perform feature selection to select the most informative variables for the model.
# - Split the data into training and testing sets for model evaluation.
# - Apply appropriate cross-validation techniques to assess the model's performance on unseen data.
```

In the modeling phase, it is important to choose appropriate evaluation metrics for imbalanced classification and consider using ensemble methods that can handle imbalanced datasets better. Feature selection, data splitting, and cross-validation techniques should also be applied.

### Modelling

In [7]:
eval_metrics = ['f1-score', 'roc_auc']
notes = ['The target variable has an imbalanced class distribution', 
        'The data has not yet been split into train and test sets',
        'The code for balancing the data with SMOTE failed because of the presence of missing values after scaling was performed.']

In [8]:
prompt = modelling_pt.format(pred_variables, target_variable, eval_metrics, notes)
gpt_3_5.query_gpt(prompt)

To perform the modeling phase, we will first import the necessary libraries: pandas, numpy, and scikit-learn. These libraries provide various machine learning algorithms, evaluation metrics, and tools for model training and evaluation.

Next, we will load the prepared dataset and split it into training and testing sets. It is important to split the data to evaluate the model's performance on unseen data. We will use the train_test_split function from scikit-learn for this purpose.

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load the prepared dataset
data = pd.read_csv('prepared_data.csv')

# Split the data into training and testing sets
X = data[predictor_variables]
y = data['OP130_Resultat_Global_v']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

Since the target variable has an imbalanced class distribution, we need to address this issue to prevent the model from being biased towards the majority class. However, the code for balancing the data with SMOTE failed because of the presence of missing values after scaling was performed. In this case, we can proceed with training the models on the imbalanced dataset and evaluate their performance using appropriate evaluation metrics.

We will train five different models based on different algorithms and evaluate their performance using the f1-score and roc_auc metrics. The f1-score is a measure of the model's accuracy, combining precision and recall, while the roc_auc metric measures the model's ability to discriminate between the positive and negative classes.

Here's an example of how the modeling phase could be implemented:

```python
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import f1_score, roc_auc_score

# Train and evaluate logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)
logreg_f1 = f1_score(y_test, logreg_pred)
logreg_roc_auc = roc_auc_score(y_test, logreg_pred)

# Train and evaluate decision tree model
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_f1 = f1_score(y_test, dt_pred)
dt_roc_auc = roc_auc_score(y_test, dt_pred)

# Train and evaluate random forest model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_f1 = f1_score(y_test, rf_pred)
rf_roc_auc = roc_auc_score(y_test, rf_pred)

# Train and evaluate support vector machine model
svm = SVC()
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)
svm_f1 = f1_score(y_test, svm_pred)
svm_roc_auc = roc_auc_score(y_test, svm_pred)

# Create a summary of the model performances
model_summary = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest', 'Support Vector Machine'],
    'F1-Score': [logreg_f1, dt_f1, rf_f1, svm_f1],
    'ROC AUC': [logreg_roc_auc, dt_roc_auc, rf_roc_auc, svm_roc_auc]
})

print(model_summary)
```

In this example, we trained and evaluated four different models: logistic regression, decision tree, random forest, and support vector machine. We calculated the f1-score and roc_auc for each model and created a summary table to compare their performances.

The choice of these algorithms was based on their suitability for binary classification tasks and their availability in scikit-learn. Logistic regression is a simple and interpretable algorithm, decision tree and random forest are non-linear algorithms that can capture complex relationships, and support vector machine is a powerful algorithm for separating classes in high-dimensional space.

By evaluating multiple models and comparing their performances using appropriate evaluation metrics, we can determine the best modeling approach for predicting defects on the production line.