# Daily Challenge: Mastering Data Analysis With ChatGPT: Iris Dataset Exploration

## This session explores the use of ChatGPT for data analysis with hands-on exercises on the Iris Species dataset from Kaggle, focusing on model selection, data anonymization, batching, querying, and best practices.

In [10]:
pip install openai==0.28

Collecting openai==0.28Note: you may need to restart the kernel to use updated packages.

  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
     -------------------------------------- 76.5/76.5 kB 124.9 kB/s eta 0:00:00
Collecting aiohttp
  Downloading aiohttp-3.9.1-cp39-cp39-win_amd64.whl (365 kB)
     -------------------------------------- 365.5/365.5 kB 1.4 MB/s eta 0:00:00
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.4.1-cp39-cp39-win_amd64.whl (50 kB)
     -------------------------------------- 50.7/50.7 kB 429.9 kB/s eta 0:00:00
Collecting async-timeout<5.0,>=4.0
  Downloading async_timeout-4.0.3-py3-none-any.whl (5.7 kB)
Collecting aiosignal>=1.1.2
  Downloading aiosignal-1.3.1-py3-none-any.whl (7.6 kB)
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.9.4-cp39-cp39-win_amd64.whl (76 kB)
     -------------------------------------- 76.9/76.9 kB 356.7 kB/s eta 0:00:00
Collecting multidict<7.0,>=4.5
  Downloading multidict-6.0.4-cp39-cp39-win_amd64.whl (28 kB)
Installing

In [8]:
import pandas as pd
import openai
import hashlib
import os
import warnings; warnings.filterwarnings(action = 'ignore')

In [2]:
# Loading the dataset 
file_path = r'C:\Users\Acer\Desktop\iris.csv'
df = pd.read_csv(file_path) 

df.info()
df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa


### Select an appropriate model for a given data analysis scenario. Justify briefly your choice

<div style="border:solid green 2px; padding: 20px">

 selected ChatGPT model: GPT-3.5.
Rationale behind choosing ChatGPT for the task: Its natural language processing capabilities and ability to assist in various stages of the data analysis process.

### Write a Python function to anonymize the ‘Species’ column. Apply your code and check its efficiency. For this section, refer to notebook 2 on anonymization and the use of the hashlib package. Use the sha256 hashing algorithm

In [4]:
def anonymize_species(dataframe):
    # Apply SHA-256 hash to each species label
    dataframe['Species'] = dataframe['Species'].apply(lambda x: hashlib.sha256(x.encode()).hexdigest())
    return dataframe

# Applying the anonymization function to the dataset
anonymized_data = anonymize_species(df)
anonymized_data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,232df5210712987992fd1445b9c338f2ad60fa3235795a...
1,2,4.9,3.0,1.4,0.2,232df5210712987992fd1445b9c338f2ad60fa3235795a...
2,3,4.7,3.2,1.3,0.2,232df5210712987992fd1445b9c338f2ad60fa3235795a...
3,4,4.6,3.1,1.5,0.2,232df5210712987992fd1445b9c338f2ad60fa3235795a...
4,5,5.0,3.6,1.4,0.2,232df5210712987992fd1445b9c338f2ad60fa3235795a...
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,26b8cda6524a21cc2c637c0bc9b9066a7179230c7f920e...
146,147,6.3,2.5,5.0,1.9,26b8cda6524a21cc2c637c0bc9b9066a7179230c7f920e...
147,148,6.5,3.0,5.2,2.0,26b8cda6524a21cc2c637c0bc9b9066a7179230c7f920e...
148,149,6.2,3.4,5.4,2.3,26b8cda6524a21cc2c637c0bc9b9066a7179230c7f920e...


###  Create a batch of queries for statistical analysis (mean, median) of the Iris dataset

In [None]:
# Set your OpenAI GPT API key
os.environ["OPENAI_API_KEY"] = "your-api-key"

In [None]:
def send_query_to_chatgpt(query, engine="text-davinci-002", max_tokens=100):
    insights = openai.Completion.create(engine=engine, prompt=query, max_tokens=max_tokens)
    generative_insights = insights.choices[0].text.strip()
    return generative_insights

# Batch of queries for statistical analysis
queries = [
    "Calculate the mean of SepalLengthCm for the Iris dataset.",
    "Compute the median of SepalWidthCm in the Iris dataset.",
    "Find the mean value of PetalLengthCm for different Iris species.",
    "Calculate the median of PetalWidthCm for the Iris dataset.",
]

# Send queries to ChatGPT in a batch
batch_responses = [send_query_to_chatgpt(query) for query in queries]

# Display the batch responses
print("Batch Responses:")
for query, response in zip(queries, batch_responses):
    print(f"Query: {query}\nResponse: {response}\n")

###  Formulate a query to analyze petal length variations in the Iris dataset.

Explore the variations in petal length (PetalLengthCm) across different species in the Iris dataset. Calculate summary statistics, such as mean, median, and standard deviation, to understand the distribution. Additionally, visualize the distribution using a histogram or boxplot to identify any patterns or outliers.

## Next Day Presentation:
Create a short presentation (<10 min) that should include the following sections:

Introduction: Introduce the challenge and discuss the chosen ChatGPT model, explain why.
Body: Detail the anonymiozation and batching processes, as well as your queries and the generated results
Conclusion: Wrap up by summarizing key learnings. Emphasize the importance of understanding ChatGPT’s capabilities and limitations in data analysis.

<div style="border:solid green 2px; padding: 20px">

**Title: Unleashing ChatGPT in Data Analysis: A Journey through Anonymization, Batching, and Insightful Queries**

**Slide 1: Introduction**

Title: Navigating Data Challenges with ChatGPT

Briefly introduce the challenge: Leveraging ChatGPT for data analysis on the Iris dataset.
Overview of the selected ChatGPT model: GPT-3.5.
Rationale behind choosing ChatGPT for the task: Its natural language processing capabilities and ability to assist in various stages of the data analysis process.

**Slide 2: Anonymization Process**
Title: Safeguarding Privacy with Anonymization

Overview of anonymization: Protecting sensitive information in the Iris dataset.
Explanation of the hashlib library and SHA-256 algorithm.
Python code snippet demonstrating the anonymization of the 'Species' column.
Emphasis on the importance of preserving privacy while working with datasets.

**Slide 3: Batching for Efficiency**
Title: Streamlining Queries with Batching

Introduction to the batching process in data analysis.
Python code snippet showcasing the batch of queries for statistical analysis.
The significance of batching in enhancing efficiency and handling multiple queries simultaneously.
Benefits of using a structured approach to streamline interactions with ChatGPT.

**Slide 4: Insightful Queries and Results**
Title: Unveiling Insights through Conversations

Overview of the queries formulated for statistical analysis of the Iris dataset.
Presentation of sample queries and their corresponding results.
Visual aids such as graphs or charts illustrating insights obtained from ChatGPT responses.
Highlights of key findings and patterns discovered through interactive querying.
    
**Slide 5: Conclusion**
Title: Navigating the ChatGPT Frontier in Data Analysis

Summary of key learnings and insights gained from the exercise.
Recognition of ChatGPT's role in aiding data analysis tasks: from anonymization to query formulation.
Acknowledgment of ChatGPT's capabilities and the need to understand its limitations.
Encouragement to leverage AI responsibly, emphasizing human expertise in critical decision-making.

**Slide 6: Q&A**
Title: Engage in Conversation

Open the floor for questions and discussions.
Encourage participants to share thoughts, experiences, and potential use cases.
Express readiness to address any queries or concerns about the presented material.