In [1]:
import os
import sys
sys.path.append('C:\\Users\\Owner\\OneDrive\\Desktop\\MSc. Bradford\\MSc. Dissertation\\llm_experiment\\src')

import gpt

MODEL = 'gpt-3.5-turbo-0613'
SECRET_KEY = os.getenv("OPENAI_SECRET_KEY")
TEMPERATURE = 0.5
gpt_3_5 = gpt.GPT(MODEL,TEMPERATURE, SECRET_KEY)

In [2]:
prompt = 'What is data cleaning?'
gpt_3_5.query_gpt(prompt)

Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in a dataset. It involves various techniques and methods to ensure that the data is accurate, complete, and reliable for analysis.

Data cleaning typically includes tasks such as:

1. Removing duplicate records: Identifying and eliminating identical or near-identical observations in the dataset.

2. Handling missing values: Dealing with missing data by either imputing values based on other observations or removing the records with missing values.

3. Correcting inaccurate data: Identifying and rectifying errors, outliers, or inconsistencies in the data that may have occurred during data collection or entry.

4. Standardizing variables: Ensuring consistency in the format and units of measurement across different variables.

5. Handling outliers: Identifying and addressing extreme values that may significantly affect the analysis or modeling results.

6. Resolving inconsistencies: Checking for inconsistencies in the data, such as conflicting information or contradictory values, and resolving them.

7. Removing irrelevant or redundant variables: Eliminating variables that are not necessary for the analysis or do not contribute meaningful information.

Data cleaning is an essential step in the data preparation process as it helps improve the quality and reliability of the data, which in turn enhances the accuracy and validity of any subsequent analysis or modeling tasks.

In [3]:
prompt = 'Write python codes to perform the tasks you listed earlier.'
gpt_3_5.query_gpt(prompt)

Certainly! Here are some examples of Python code to perform common data cleaning tasks:

1. Removing duplicate records:
```python
# Assuming 'data' is a pandas DataFrame
data = data.drop_duplicates()
```

2. Handling missing values:
```python
# Assuming 'data' is a pandas DataFrame
# Drop rows with any missing values
data = data.dropna()

# Impute missing values with the mean of the column
data['column_name'].fillna(data['column_name'].mean(), inplace=True)
```

3. Correcting inaccurate data:
```python
# Assuming 'data' is a pandas DataFrame
# Replace incorrect values with correct ones
data['column_name'].replace({'incorrect_value': 'correct_value'}, inplace=True)
```

4. Standardizing variables:
```python
# Assuming 'data' is a pandas DataFrame
# Convert a variable to a specific data type
data['column_name'] = data['column_name'].astype('int')

# Convert units of measurement
data['column_name'] = data['column_name'] * conversion_factor
```

5. Handling outliers:
```python
# Assuming 'data' is a pandas DataFrame
# Remove outliers using z-score method
from scipy import stats
data = data[(np.abs(stats.zscore(data['column_name'])) < 3)]
```

6. Resolving inconsistencies:
```python
# Assuming 'data' is a pandas DataFrame
# Identify and fix inconsistent values
data['column_name'].replace({'inconsistent_value': 'consistent_value'}, inplace=True)
```

7. Removing irrelevant or redundant variables:
```python
# Assuming 'data' is a pandas DataFrame
# Drop irrelevant or redundant columns
data = data.drop(['column_name1', 'column_name2'], axis=1)
```

Please note that these are general examples, and the code may need to be modified based on the specific dataset and requirements. Also, make sure to import the necessary libraries (e.g., pandas, numpy, scipy) before executing the code.

In [4]:
prompt = 'How do I know which task applies to my dataset?'
gpt_3_5.query_gpt(prompt)

To determine which data cleaning tasks apply to your dataset, you need to understand the characteristics and quality of your data. Here are some steps to help you identify the specific tasks required:

1. Data profiling: Perform an initial exploration of your dataset to gain insights into its structure, variables, and quality. This can involve examining summary statistics, checking for missing values, identifying unique values, and visualizing distributions or patterns.

2. Data quality assessment: Evaluate the quality of your data by assessing its completeness, accuracy, consistency, and relevance. This can be done by checking for missing values, outliers, inconsistencies, and redundant or irrelevant variables.

3. Domain knowledge: Consider your knowledge of the domain to identify potential data issues or errors that may commonly occur. For example, if you are working with customer data, you might expect missing values in certain fields like phone numbers or addresses.

4. Data exploration: Conduct exploratory data analysis (EDA) to further understand your data and identify any anomalies or patterns that may require cleaning. Visualization techniques such as scatter plots, histograms, or box plots can help in identifying outliers or unusual data points.

5. Data cleaning best practices: Familiarize yourself with common data cleaning tasks and techniques. This includes handling missing values, removing duplicates, correcting inaccurate data, standardizing variables, resolving inconsistencies, and removing irrelevant or redundant variables. Consider the applicability of these tasks to your dataset based on the issues identified in steps 1-4.

By following these steps, you can gain a better understanding of your dataset's characteristics and identify the data cleaning tasks that are necessary to ensure its quality and reliability for analysis.

In [5]:
prompt = 'Can you explain the code you shared under point 5: Handling Outliers?'
gpt_3_5.query_gpt(prompt)

Certainly! The code provided for handling outliers using the z-score method is as follows:

```python
# Assuming 'data' is a pandas DataFrame
# Remove outliers using z-score method
from scipy import stats
data = data[(np.abs(stats.zscore(data['column_name'])) < 3)]
```

Here's a breakdown of the code:

1. The code assumes that you have a pandas DataFrame named 'data' and you want to remove outliers from a specific column named 'column_name'. You can replace 'column_name' with the actual name of the column you want to clean.

2. The code imports the 'stats' module from the 'scipy' library. The 'stats' module provides various statistical functions and methods, including the z-score calculation.

3. The code applies the z-score calculation to the 'column_name' column using the 'stats.zscore()' function. The z-score measures how many standard deviations an observation is away from the mean. It helps identify observations that are significantly different from the rest of the data.

4. The code uses the 'np.abs()' function from the numpy library to calculate the absolute value of the z-scores. This ensures that we consider the magnitude of the deviation from the mean rather than the direction.

5. The code applies a condition to filter the DataFrame 'data' using boolean indexing. It keeps only the rows where the absolute z-score is less than 3. This threshold of 3 is a commonly used value to identify outliers. You can adjust this threshold based on your specific requirements.

By applying this code, rows with outliers in the specified column will be removed from the 'data' DataFrame, leaving behind a cleaned dataset without those extreme values.