<div style="text-align: center;">
  <h1>Large Shipment - Step 1</h1>
  <img src="data/img.png" alt="image">
</div>

In the first step of the project, suppose we have received a large shipment of foreign chocolates, and fortunately, the seller has provided us with a list of detailed specifications for each chocolate. Now, we aim to examine these specifications further and improve the structure for storing this information.

## Dataset

The information about the chocolates is provided in a file named <code>chocolate.csv</code>. First, let's read this information into a dataframe using the cell below.



In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv('./data/chocolate.csv')
df.head()

## Part One

To understand the large volume of chocolates and their detailed specifications, store and print the dimensions (shape) and column names (columns) of this dataset in the variables below:


In [None]:
s = None # To-Do (shape)
cols = None # To-Do (columns)
print(s)
print(cols)

💥 **In Part One, what format should I use to store the dimensions and column names?**

In this part, you can assign the variables `s` and `cols` with any structure,
as they will ultimately be converted to NumPy arrays when saved to a file.
However, the standard format for these two variables would look something like the following:

```python
Index(['Col1', 'Col2', ...], dtype='object')
```

Note that the numbers 100 and 10 are just examples, but they represent the number of rows and columns of the dataframe, respectively.


## Part Two

As you noticed, some column names contain the character `\n`,
which is used to create a new line but is not displayed as such here.
To tidy up the dataset, modify the column names to replace the `\n` character with a space.

In [None]:
df.columns = None # To-Do

Using the `info()` command, we can view general information about the dataframe and the types of its columns.
We suggest that before proceeding, you use this command to take a look at the dataframe and analyze it a bit.

In [None]:
df.info()

## Part Three

As you may have noticed, all columns containing string values (e.g., the `Company` column) are stored as `object`.
Since there are no missing values (which you will learn about in the next chapter),
there is no need for these columns to remain as `object`.
For ease of calculations, convert these columns to the string type and use the `info` command to verify that the change was successful.

💡 **Hint 1:**
To convert the type of a column to another type, you can use the `astype` function. For example:

```python
df['col'] = df['col'].astype('int64')
```

💡 **Hint 2:**
Ensure that you convert the column type to `string`, not `str`,
as using `str` will result in the column type remaining as `object`.
To verify your answer, you can use `df.info()` to check the column type after the change.


💥 **How can I write a condition to detect if a column is of type `object`?**<br>
Suppose the column name is stored in the variable `i`. You can write:

```python
if df.dtypes[i] == 'O':
```

In [None]:
# To-Do
df.info()

## Part Four

The next issue is the column for the percentage of pure cocoa, `Cocoa Percent`.
As you can see, this column is currently stored as a string,
but its values actually represent numerical data.
To better normalize this column, find a way to remove the percentage sign (`%`) from the values
and then convert the column type to `float`.

💬 **Note:**
In this part, you only need to remove the percentage sign (`%`) and ensure the column type is converted to `float`.
For example, the value `63%` should be converted to something like `63.0`.

💡 **Hint:**
To remove the percentage sign, you can use the `str.replace` function.


In [None]:
df['Cocoa Percent'] = None # To-Do

## Analyzing the Frequency of Cocoa Percentage

In many cases, data visualization can quickly reveal important insights. For example, to examine the cocoa percentage in different chocolates, we can use a histogram chart.
 <br>
Don't panic! 😉 You'll soon learn about various types of charts and how to create them in the "Data Visualization".

In [None]:
import seaborn as sns
sns.displot(df, x = 'Cocoa Percent', bins = 30)

By examining the chart above, it is evident that a large percentage of chocolates have a cocoa content of 70%.

⚠️ **Be careful:**
Why does running the cell for plotting the chart result in an error?<br>
Please ensure that the `seaborn` library is installed in your working environment.
If it is not installed, you can install it using the following code:

```bash
pip install seaborn
```

## Data Storage

🎉 Now all the specifications are organized and ready for analysis.
You need to save the preprocessed dataset in a file named `chocolate_preprocessed.csv` without the index (`index`) so that it can be used in the next step.


In [None]:
# To-Do