## Introduction

This workspace is just a way to share instructions and datasets with you. You'll be using [ChatGPT](https://chat.openai.com/) and its [Advanced Data Analysis](https://help.openai.com/en/articles/8437071-advanced-data-analysis-chatgpt-enterprise-version) tool for the data analysis.

Click through to ChatGPT and log in or sign up. You'll need one of the premium plans: either ChatGPT Plus or ChatGPT Enterprise.

## Enabling the Advanced Data Analysis tool

### Open the 'Settings & Beta' menu

<img src="images/chapgpt-settings-beta-context-menu.png" width="200px" />

- In the bottom left, click the ellipsis next to your name
- Click 'Settings & Beta'

### Turn ADA on

<img src="images/chatgpt-settings-beta-features-ada.png" width="400px" />

- Select the 'Beta features' pane
- Toggle 'Advanced data analysis' to _on_

### Enable ADA in a new GPT-4 chat

<img src="images/chatgpt-plugins-advanced-data-interpreter.png" width="200px" />

- Start a new chat
- Select the GPT-4 model
- Hover over GPT-4 and in the dropdown menu select 'Advanced Data Analysis'



## Technical Advice for Broken ChatGPT Sessions

It seems like OpenAI are quite quick to kill your chat sessions if you leave them idle. Once that happens, your Python session is lost, and ChatGPT can get confused when trying to run things again.

If you get a message like 

> It seems that there was a technical issue while processing the request. 

or

> It seems that we're experiencing technical difficulties with the calculations.

then it is best to start a new chat and run through the steps again.

## A Simple Data Analytics Workflow: Popular Media Franchises

We'll start by analyzing a dataset about the highest grossing media franchises in the world. (You can also see this dataset in the [MySQL Basics cheat sheet](https://www.datacamp.com/cheat-sheet/my-sql-basics-cheat-sheet).)

This simple flow includes three steps:

- Importing a dataset from an Excel file.
- Asking some questions about the dataset.
- Creating a modified dataset for download.

### Instructions

- Download the 'Highest Grossing Media Franchises.xlsx' spreadsheet.
- Add this file to a prompt.  
    <img src="images/chatgpt-add-file-plus-button.png" width="150px" />
- In the same prompt, ask GPT to import the dataset.

```
Analyze the dataset in the 'Highest Grossing Media Franchises.xlsx' spreadsheet.  
- The dataset is contained in the 'data' worksheet. 
- A description of the dataset is included in the 'data dictionary' worksheet. In this sheet, the tabular data starts on the third row. 

To begin, import both worksheets and display their contents.
```

- Give GPT some questions to answer.

```
Answer the following questions about the dataset:   

- Which franchise has the highest total revenue?  
- Which company owns the most franchises, and how many do they own?  
- What is the mean total revenue grouped by medium?
```

- Get GPT to create a file for you to download.

```
Generate a CSV file containing a subset of the media franchise dataset. 

Only include rows where the owner is 'The Walt Disney Company'.

Call the file 'walt-disney-franchises.csv'
```

One limitation of ChatGPT is that it can't display interactive plots, like those generated by Plotly. You have two choices. 

1. Stick to a static plotting tool, like Seaborn.
2. Generate Python code to draw a plot that you can copy and paste into you favorite data analysis tool (obviously DataCamp Workspace).

### Instructions

- Ask ChatGPT to generate Python code to draw a plot.

```
Draw a Plotly Express bar plot of the total revenue by franchise with bars colored by original medium.
```

- This will fail, so try again using Seaborn.

```
Try again. This time, use the Seaborn package to draw the plot.
```

## A More Sophisticated Workflow: American Football Results

Let's try something slightly more complicated: getting data from a webpage.

One of the definitive sources of data for American Football is pro-football-reference.com. The datasets it contains are amazing ... but there is no API and the website seems pathologically designed to make web scraping hard.

We'll scrape the current week's results, which have an individual table for each game, with a fairly novel structure.

<img src="images/pff-box-scores.png" />

That provides a great test of whether ChatGPT can do save you a painful web scraping experience.

### Instructions

- Go to https://www.pro-football-reference.com/boxscores/
- Save the web page as HTML.

Backup plan: There's a recent copy of the page in this workspace as `Latest NFL Scores Pro-Football-Reference.com.html`.

- Start a new chat, making sure ADA is enabled.
- Add the HTML page to the prompt.
- Ask GPT to scrape the data for this week's games from the page. Firefox and Chrome give slightly different file names when saving; you make need to adjust this name in the prompt.

```
Import and scrape the webpage 'Latest NFL Scores Pro-Football-Reference.com.html'.

In the section '2023 Week 8' you will find several tables. Each tables contains data about one American football game that was played this week.

From each table, check the game status. If the status is "Preview" ignore that table.

If the game status is "Final", then extract the following data and create a data frame with one row, and columns as specified.

- `game_date`: the date that the game took place
- `winning_team`: the name of the winning team
- `winning_score`: the score of the winning team
- `losing_team`: the name of the losing team
- `losing_score`: the score of the losing team
- `pass_yds_player`: the name of the player with the most passing yards
- `pass_yds_value`: the value of the most passing yards
- `rush_yds_player`: the name of the player with the most rushing yards
- `rush_yds_value`: the value of the most rushing yards
- `rec_yds_player`: the name of the player with the most receiving yards
- `rec_yds_value`: the value of the most receiving yards

Vertically concatenate the rows from each of these data frames into a single data frame.
```

- Ask GPT to perform some data analysis.

```
Answer the following questions about the dataset

- Which team won by the largest point difference?
- Which player had the most receiving yards in a game this week? What is the three-letter abbreviation for his team name?
- What was the average score of the losing teams this week?
```

**PLAY TIME**

- Think of some more questions to ask about the dataset, and ask GPT.


Data cleaning is often tedious, so it's a great candidate for outsourcing to AI.

In this dataset, one problem that occured for me is that the winning team and losing team columns contained extra content after a newline.

<img src="images/winning-team-needs-cleaning.png" width="200" />

This problem may not happen every time you use ChatGPT for this task, since there's an element of randomness, but it's worth solving if it does occur.

### Instructions

- Ask GPT to clean up the winning and losing team contents.

```
Modify the dataset, saving the result into a new dataframe.

In the winning team column, strip any characters from the newline onwards. The team name should contain only alphabetic characters, numeric digits, and spaces. 

Do the same for the losing team column.
```

Another issue is that the columns for the players with the most passing/rushing/receiving yards actually contain the player surname and a three letter abbreviation for the team name. Ideally, we'd have the surname on one column and the complete team name in another column.

### Instructions

- Split the passing yds player column into two.

```
Modify the dataset, saving the result into a new dataframe. The "Pass Yds Player" column currently contains the name of the player with the most passing yards and also a three letter code for his team, separated by a hyphen. Split this column into two, as follows:

- `rec_yds_player_surname`: The first part, containing the surname
- `rec_yds_player_team_code`: The second part, containing the three letter code for his team.
```

- Try to replace the three letter codes with the full team name.

```
Replace the three letter code in `rec_yds_player_team_code` with the full team name. You can find this in either the winning team column or the losing team column.
```

You will likely find that ChatGPT fails to accomplish this task since writing Python code to guess the full name from the abbreviation is hard. 

Fortunately, there's another way to solve this: ChatGPT knows how how NFL team names and their abbreviations match up, because it's been trained on a vast amount of internet data where these abbreviations are used (for example, [this dataset on GitHub](https://gist.github.com/cnizzardini/13d0a072adb35a0d5817)).

So we can use GPT's knowledge to create a lookup dataset.

### Instructions

- Ask GPT to create a lookup dataset using its knowledge of the NFL.

```
Try creating a dataset that matches NFL team names with their standard three letter abbreviation. Use your knowledge of the NFL. 

If you don't know the answer, don't make values up, just leave them missing. 

Provide the results as a two column table.
```

- Ask GPT to use the new lookup dataset to get full team names.

```
Well done! Store this table as a Pandas dataframe, then join the game results dataframe to this lookup dataframe to help solve the previous problem of getting the team name for the player with the most passing yards.
```

## Summary

You've seen how to use ChatGPT Advanced Data Analysis tool for data analysis.

- You can provide data files to it to work on.
- You can ask it natural language questions about your dataset, and it runs Python code to anser them.
- It can't generate plots, but it can generate Python code to draw a plot.
- If generating Python code is hard, you can often try a different approach, because ChatGPT is good at solving many problems without code.

## Bonus task

You've done well! Time to celebrate with some AI generated art.

You can enable DALLE-3, OpenAI's image generation AI, in the same way you enabled Advanced Data Analysis.

<img src="images/chatgpt-plugins-dalle3.png" width=200 />

Here's some guidance on prompting for image generation, taken from [Adel's prompt-along](https://www.datacamp.com/code-along/chat-gpt-prompt-engineering-for-beginners).

<img src="images/image-prompt-guidance.png" width=400 />

### Instructions

- Ask ChatGPT to draw a picture of data analysts celebrating. Be imaginative with your prompt!

## Keep learning!

- Learn how to make use of AI in the workplace with the [AI Business Fundamentals](https://www.datacamp.com/tracks/ai-business-fundamentals) skill track.
- Get ChatGPT prompt ideas with the [ChatGPT for Data Science](https://www.datacamp.com/cheat-sheet/chatgpt-cheat-sheet-data-science) and [ChatGPT for Business](https://www.datacamp.com/cheat-sheet/business-use-cases-of-chatgpt) cheat sheets.
- Need learning inspiration? Read this tutorial on [How to Learn AI From Scratch](https://www.datacamp.com/blog/how-to-learn-ai)