In [None]:
# An introduction to the library and its sources as comments. (10 Points) 
"""
"OpenAI Python" refers to the Python library that allows developers to access and utilize the powerful 
AI models developed by OpenAI within their Python applications, essentially providing a way to integrate 
advanced features like text generation, translation, and question answering into their projects with 
simple code commands, without needing to build the complex AI models themselves; essentially acting 
as a bridge to interact with OpenAI's AI capabilities through Python. 
"""

In [None]:
# Advantages, limitations of using the library also provided as comments (10 Points)
"""
<Advantages>
1) Advanced AI capabilities: 
OpenAI provides access to powerful AI models like GPT-3, allowing for complex tasks like text generation, translation, and summarization with high accuracy. 

2) Easy integration with Python:
OpenAI offers a well-documented Python library, making it straightforward to integrate their models into existing Python applications. 

3) Flexibility and customization:
Python's vast ecosystem of libraries allows for extensive data preprocessing and fine-tuning of OpenAI models to suit specific needs. 

4) Rapid development:
By utilizing pre-trained OpenAI models, developers can quickly prototype and implement AI features in their applications. 

5) Advanced analytics:
OpenAI can analyze large datasets to provide valuable insights for decision making and application improvement.


<Disadvantages>
1) Cost considerations:
Using OpenAI models often involves subscription fees based on usage, which can become expensive for large-scale applications. 

2) Lack of transparency:
The inner workings of OpenAI models can be opaque, making it difficult to understand how they generate outputs or identify potential biases. 

3) Ethical concerns:
AI models can perpetuate biases present in training data, requiring careful consideration of ethical implications when using OpenAI. 

4) Potential for misuse:
The ability to generate realistic text can be misused for creating misinformation or harmful content. 

5) Computational demands:
Running complex OpenAI models may require significant computing power, especially for large-scale applications. 
"""

In [None]:
#Code to demonstrate calling the library, and use an example dataset to run the library. (30 Points)

In [1]:
# install from PyPI
!pip install openai
!pip install --upgrade openai



In [65]:
import openai
from openai import OpenAI
from sk import my_sk # created a seperate .py file for my API secret key 

In [5]:
OpenAI.api_key = my_sk # Add my API key 
client = OpenAI(api_key= my_sk)

In [67]:
#Example prompt
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Say this is a test",
        }
    ],
    model="gpt-4o",   
)

#Print the AI's response
print(chat_completion.choices[0].message.content)
print(chat_completion)

#AI's response
"""
This is a test.
ChatCompletion(id='chatcmpl-AV5MAbxDRNCmGIGHy4gJkdOZaY8rb', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='This is a test.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1731972122, model='gpt-4o-2024-08-06', object='chat.completion', service_tier=None, system_fingerprint='fp_45cf54deae', usage=CompletionUsage(completion_tokens=5, prompt_tokens=12, total_tokens=17, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))
"""

This is a test.
ChatCompletion(id='chatcmpl-AV7NmruyRx0hNIhZ51x6MZmfHrMqD', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='This is a test.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1731979910, model='gpt-4o-2024-08-06', object='chat.completion', service_tier=None, system_fingerprint='fp_7e2833e5f9', usage=CompletionUsage(completion_tokens=5, prompt_tokens=12, total_tokens=17, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))


"\nChatCompletion(id='chatcmpl-AV5MAbxDRNCmGIGHy4gJkdOZaY8rb', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='This is a test.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1731972122, model='gpt-4o-2024-08-06', object='chat.completion', service_tier=None, system_fingerprint='fp_45cf54deae', usage=CompletionUsage(completion_tokens=5, prompt_tokens=12, total_tokens=17, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))\n"

In [49]:
# Load the dataset
import pandas as pd 
Diabetes = pd.read_csv("/Users/sunho/diabetes+130-us+hospitals+for+years+1999-2008/diabetic_data.csv")
Diabetes_5 = Diabetes.iloc[:5]
Diabetes_string = Diabetes_5.to_string(index=False)
# Diabetes.head()
# Diabetes.column

In [61]:
# Data analysis prompt
prompt = f"""
            I have a dataset with diabetes: {Diabetes_string} In the columns it has  'diag_1','diag_2', 'diag_3'. 
            Do a groupby count with 'gender' and 'race'.
         """
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt 
        }
    ],
    model="gpt-4o", 
    max_tokens=2000
)

#Print the AI's response
print(chat_completion.choices[0].message.content)
print(chat_completion)

# AI's response
"""
To obtain a groupby count of the `diag_1`, `diag_2`, and `diag_3` columns based on `gender` and `race`, you can use the Pandas library in Python. You would group by the specified columns and aggregate the count of diagnoses. Here's how you can do it:

Assuming you have your data loaded into a Pandas DataFrame named `df`, you can use the following code:

```python
import pandas as pd

# Assuming df is your DataFrame
# Replace 'your_data.csv' with the path to your dataset file if necessary
# df = pd.read_csv('your_data.csv')

# Group by 'gender' and 'race' and count the occurrences of 'diag_1', 'diag_2', and 'diag_3'
grouped = df.groupby(['gender', 'race'])[['diag_1', 'diag_2', 'diag_3']].count()

# Display the resulting grouped DataFrame
print(grouped)
```

This code will produce a DataFrame where each combination of `gender` and `race` will display the count of non-null entries for `diag_1`, `diag_2`, and `diag_3`. 

The `count()` function automatically excludes `NaN` values, meaning it will count only the entries that contain actual diagnostic codes. The result will list different genders and races along with counts of each diagnosis column.
ChatCompletion(id='chatcmpl-AV78fkzJBITeyFqjvCbU1GGGN4Ewv', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="To obtain a groupby count of the `diag_1`, `diag_2`, and `diag_3` columns based on `gender` and `race`, you can use the Pandas library in Python. You would group by the specified columns and aggregate the count of diagnoses. Here's how you can do it:\n\nAssuming you have your data loaded into a Pandas DataFrame named `df`, you can use the following code:\n\n```python\nimport pandas as pd\n\n# Assuming df is your DataFrame\n# Replace 'your_data.csv' with the path to your dataset file if necessary\n# df = pd.read_csv('your_data.csv')\n\n# Group by 'gender' and 'race' and count the occurrences of 'diag_1', 'diag_2', and 'diag_3'\ngrouped = df.groupby(['gender', 'race'])[['diag_1', 'diag_2', 'diag_3']].count()\n\n# Display the resulting grouped DataFrame\nprint(grouped)\n```\n\nThis code will produce a DataFrame where each combination of `gender` and `race` will display the count of non-null entries for `diag_1`, `diag_2`, and `diag_3`. \n\nThe `count()` function automatically excludes `NaN` values, meaning it will count only the entries that contain actual diagnostic codes. The result will list different genders and races along with counts of each diagnosis column.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1731978973, model='gpt-4o-2024-08-06', object='chat.completion', service_tier=None, system_fingerprint='fp_45cf54deae', usage=CompletionUsage(completion_tokens=293, prompt_tokens=877, total_tokens=1170, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))
'\nTo perform a `groupby` operation on the data with `gender` and `race` and get the count of occurrences for the columns `diag_1`, `diag_2`, and `diag_3`, you can use the `pandas` library in Python. Here is how you can achieve this:\n\n```python\nimport pandas as pd\n\n# Assuming \'data\' is your DataFrame\n\n# Group by \'gender\' and \'race\', then count occurrences in \'diag_1\nChatCompletion(id=\'chatcmpl-AV75xHiavglfPAawJW2aO1R7D49Tu\', choices=[Choice(finish_reason=\'length\', index=0, logprobs=None, message=ChatCompletionMessage(content="To perform a `groupby` operation on the data with `gender` and `race` and get the count of occurrences for the columns `diag_1`, `diag_2`, and `diag_3`, you can use the `pandas` library in Python. Here is how you can achieve this:\n\n```python\nimport pandas as pd\n\n# Assuming \'data\' is your DataFrame\n\n# Group by \'gender\' and \'race\', then count occurrences in \'diag_1", refusal=None, role=\'assistant\', audio=None, function_call=None, tool_calls=None))], created=1731978805, model=\'gpt-4o-2024-08-06\', object=\'chat.completion\', service_tier=None, system_fingerprint=\'fp_45cf54deae\', usage=CompletionUsage(completion_tokens=100, prompt_tokens=877, total_tokens=977, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))\n\'"\nTo perform a group by count of the columns \'\nChatCompletion(id=\'chatcmpl-AV74CDiDTHu55Yyc2eu5PJrAVDXnP\', choices=[Choice(finish_reason=\'length\', index=0, logprobs=None, message=ChatCompletionMessage(content="To perform a group by count of the columns \'", refusal=None, role=\'assistant\', audio=None, function_call=None, tool_calls=None))], created=1731978696, model=\'gpt-4o-2024-08-06\', object=\'chat.completion\', service_tier=None, system_fingerprint=\'fp_45cf54deae\', usage=CompletionUsage(completion_tokens=10, prompt_tokens=877, total_tokens=887, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))\n\'\n'
"""

'\nTo obtain a groupby count of the `diag_1`, `diag_2`, and `diag_3` columns based on `gender` and `race`, you can use the Pandas library in Python. You would group by the specified columns and aggregate the count of diagnoses. Here\'s how you can do it:\n\nAssuming you have your data loaded into a Pandas DataFrame named `df`, you can use the following code:\n\n```python\nimport pandas as pd\n\n# Assuming df is your DataFrame\n# Replace \'your_data.csv\' with the path to your dataset file if necessary\n# df = pd.read_csv(\'your_data.csv\')\n\n# Group by \'gender\' and \'race\' and count the occurrences of \'diag_1\', \'diag_2\', and \'diag_3\'\ngrouped = df.groupby([\'gender\', \'race\'])[[\'diag_1\', \'diag_2\', \'diag_3\']].count()\n\n# Display the resulting grouped DataFrame\nprint(grouped)\n```\n\nThis code will produce a DataFrame where each combination of `gender` and `race` will display the count of non-null entries for `diag_1`, `diag_2`, and `diag_3`. \n\nThe `count()` functio

In [63]:
# Tried to make the AI do the groupby itself but somehow it won't do it
prompt = f"""
            I have a dataset with diabetes: {Diabetes_string} In the columns it has  'diag_1','diag_2', 'diag_3'. 
            Do a groupby count with 'gender' and 'race'. Run the code yourslef and show me the result.
         """
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt 
        }
    ],
    model="gpt-4o", 
    max_tokens=2000
)

#Print the AI's response
print(chat_completion.choices[0].message.content)
print(chat_completion)


#AI's response
"""
I currently cannot execute real-time code, but I can certainly guide you on how to perform a groupby count in Python using the Pandas library.

Here's how you can perform the groupby operation and get the count based on 'gender' and 'race':

```python
import pandas as pd

# Assuming `df` is your DataFrame
grouped = df.groupby(['gender', 'race']).size().reset_index(name='count')
print(grouped)
```

This code snippet will group your DataFrame by 'gender' and 'race' and then calculate the count of occurrences for each combination of 'gender' and 'race', displaying the result.

As I cannot run this code, when you execute it on your own machine, you should see an output with the count of records for each unique combination of 'gender' and 'race' in your dataset.
ChatCompletion(id='chatcmpl-AV7GZE3uIBp9QVlYadyIB07urrVJn', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="I currently cannot execute real-time code, but I can certainly guide you on how to perform a groupby count in Python using the Pandas library.\n\nHere's how you can perform the groupby operation and get the count based on 'gender' and 'race':\n\n```python\nimport pandas as pd\n\n# Assuming `df` is your DataFrame\ngrouped = df.groupby(['gender', 'race']).size().reset_index(name='count')\nprint(grouped)\n```\n\nThis code snippet will group your DataFrame by 'gender' and 'race' and then calculate the count of occurrences for each combination of 'gender' and 'race', displaying the result.\n\nAs I cannot run this code, when you execute it on your own machine, you should see an output with the count of records for each unique combination of 'gender' and 'race' in your dataset.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1731979463, model='gpt-4o-2024-08-06', object='chat.completion', service_tier=None, system_fingerprint='fp_45cf54deae', usage=CompletionUsage(completion_tokens=175, prompt_tokens=888, total_tokens=1063, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))
"""

I currently cannot execute real-time code, but I can certainly guide you on how to perform a groupby count in Python using the Pandas library.

Here's how you can perform the groupby operation and get the count based on 'gender' and 'race':

```python
import pandas as pd

# Assuming `df` is your DataFrame
grouped = df.groupby(['gender', 'race']).size().reset_index(name='count')
print(grouped)
```

This code snippet will group your DataFrame by 'gender' and 'race' and then calculate the count of occurrences for each combination of 'gender' and 'race', displaying the result.

As I cannot run this code, when you execute it on your own machine, you should see an output with the count of records for each unique combination of 'gender' and 'race' in your dataset.
ChatCompletion(id='chatcmpl-AV7GZE3uIBp9QVlYadyIB07urrVJn', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="I currently cannot execute real-time code, but I can certainly guide yo