# <center>Preprocessing Techniques in Python: A Basic Guide</center>

## Introduction to Data Preprocessing

Data preprocessing is a crucial step in the data analysis pipeline, as it involves cleaning, transforming, and organizing raw data to make it suitable for further analysis. This process helps improve the quality and reliability of data, enabling accurate insights and predictions. In this IPython Notebook (ipynb), we will explore some fundamental preprocessing techniques using Python and popular libraries such as pandas. 

The notebook will cover various tasks, including importing a dataset, examining its structure and contents, manipulating data using pandas functions, and performing basic statistical analysis. By following this guide, you will gain a solid understanding of the essential preprocessing steps and how to apply them to your own datasets. 

Now let's dive into the specific questions and tasks covered in this notebook. 

## Questions and Descriptions

### Import the dataset: 

- To begin with, we will import the dataset into our notebook. This involves reading the data from a file (e.g., CSV, Excel) or fetching it from an external source (e.g,. API). We will use appropriate pandas functions or other libraries to accomplish this task. 

In [2]:
import pandas as pd 

In [4]:
# Import the Dataset
data = pd.read_csv('candy.csv')
data

Unnamed: 0,id,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,0,100 Grand,Yes,No,Yes,No,No,Yes,No,Yes,No,0.732,0.860,66.971725
1,1,3 Musketeers,Yes,No,No,No,Yes,No,No,Yes,No,0.604,0.511,67.602936
2,2,Air Heads,No,Yes,No,No,No,No,No,No,No,0.906,0.511,52.341465
3,3,Almond Joy,Yes,No,No,Yes,No,No,No,Yes,No,0.465,0.767,50.347546
4,4,Baby Ruth,Yes,No,Yes,Yes,Yes,No,No,Yes,No,0.604,0.767,56.914547
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78,78,Twizzlers,No,Yes,No,No,No,No,No,No,No,0.220,0.116,45.466282
79,79,Warheads,No,Yes,No,No,No,No,Yes,No,No,0.093,0.116,39.011898
80,80,Welch's Fruit Snacks,No,Yes,No,No,No,No,No,No,Yes,0.313,0.313,44.375519
81,81,Werther's Original Caramel,No,No,Yes,No,No,No,Yes,No,No,0.186,0.267,41.904308


### Display the head of the dataset

- After importing the dataset, we will display the first few rows to get a glimpse of its structure and contents. This step helps us to understand the variables and their initial values, providing an overview of the dataset. 

In [12]:
# Display the head of the dataset
data.head()

Unnamed: 0,id,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,0,100 Grand,Yes,No,Yes,No,No,Yes,No,Yes,No,0.732,0.86,66.971725
1,1,3 Musketeers,Yes,No,No,No,Yes,No,No,Yes,No,0.604,0.511,67.602936
2,2,Air Heads,No,Yes,No,No,No,No,No,No,No,0.906,0.511,52.341465
3,3,Almond Joy,Yes,No,No,Yes,No,No,No,Yes,No,0.465,0.767,50.347546
4,4,Baby Ruth,Yes,No,Yes,Yes,Yes,No,No,Yes,No,0.604,0.767,56.914547


### Display the tail of the dataset

- Similar to displaying the head, we will also display the last few rows of the dataset. This allows us to check if there are any patterns or trends at the end of the data, which might be useful for analysis or preprocessing decisions. 

In [13]:
# Display the Tail of the Dataset
data.tail()

Unnamed: 0,id,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
78,78,Twizzlers,No,Yes,No,No,No,No,No,No,No,0.22,0.116,45.466282
79,79,Warheads,No,Yes,No,No,No,No,Yes,No,No,0.093,0.116,39.011898
80,80,Welch's Fruit Snacks,No,Yes,No,No,No,No,No,No,Yes,0.313,0.313,44.375519
81,81,Werther's Original Caramel,No,No,Yes,No,No,No,Yes,No,No,0.186,0.267,41.904308
82,82,Whoppers,Yes,No,No,No,No,Yes,No,No,Yes,0.872,0.848,49.524113


### Display the Column names

- In this task, we will extract and diisplay the column names of the dataset. By knowing the column names, we can refer to specific variables during data manipulation and analysis. 

In [15]:
# Display the column names
for col in data.columns:
    print(col)

id
competitorname
chocolate
fruity
caramel
peanutyalmondy
nougat
crispedricewafer
hard
bar
pluribus
sugarpercent
pricepercent
winpercent


### Display the datatypes of the columns

- Understanding the datatypes of the columns is essential to perform accurate calculations, transformations and handle missing values effectively. We will retrieve and display the datatypes of each column in the dataset. 

In [7]:
# Display the Datatypes of the columns
# datatypes = data.dtypes
# print(datatypes)
data.dtypes

id                    int64
competitorname       object
chocolate            object
fruity               object
caramel              object
peanutyalmondy       object
nougat               object
crispedricewafer     object
hard                 object
bar                  object
pluribus             object
sugarpercent        float64
pricepercent        float64
winpercent          float64
dtype: object

### Display the statistical Information about the columns. 

- Statistical information provides insights into te distribution, central tendency, and spread of the data. We will compute and display various statistical measures such as mean, standard deviation, minimum, maximum, quartiles, ets., for each numerical column in the dataset. 

In [11]:
# Display statistical iformation about suitable columns.
data.describe()
# data.info()

Unnamed: 0,id,sugarpercent,pricepercent,winpercent
count,83.0,83.0,83.0,83.0
mean,41.0,0.489916,0.472627,50.584908
std,24.103942,0.276498,0.286503,14.74888
min,0.0,0.034,0.011,22.445341
25%,20.5,0.267,0.261,39.16328
50%,41.0,0.465,0.465,48.982651
75%,61.5,0.732,0.703,60.332349
max,82.0,0.988,0.976,84.18029


### Uses of the .iloc function in columns. 

- The .loc function in pandas allows us to access and manipulate specific rows and columns based on their labels. We will explore the different ways touse the .loc function specifically for columns, which can be helpful for data extraction or transformation tasks. 

In [21]:
# Display all the rows of columns Chocolate, Caramel, Fruity
data.loc[:,['chocolate', 'caramel', 'fruity']]

Unnamed: 0,chocolate,caramel,fruity
0,Yes,Yes,No
1,Yes,No,No
2,No,No,Yes
3,Yes,No,No
4,Yes,Yes,No
...,...,...,...
78,No,No,Yes
79,No,No,Yes
80,No,No,Yes
81,No,Yes,No


### Display the total count of the column. 

- Counting the number of non-null values in a column is crucial for identifying missing or incomplete data. We will calculate and display the total count for each column, providing an overview of the data completeness.

In [25]:
# Display the total number of Competitors 
data.competitorname.count()

83

### Display by slicing the dataset using iloc and loc commands:

- Slicing the dataset enables us to extract specific portions or subsets of the data based on row and column indices. We will demonstrate how to use the iloc and loc commands to perform slicing operations on the dataset, enabling flexible data extraction for further analysis.

In [27]:
# Display by slicing the dataset using iloc and loc commands.
data.loc[:, 'competitorname']

0                      100 Grand
1                   3 Musketeers
2                      Air Heads
3                     Almond Joy
4                      Baby Ruth
                 ...            
78                     Twizzlers
79                      Warheads
80          Welch's Fruit Snacks
81    Werther's Original Caramel
82                      Whoppers
Name: competitorname, Length: 83, dtype: object

In [13]:
# Check the dataset for any null value and fill the null value with 0.01
data.fillna(value=0.01, inplace=True)
data

Unnamed: 0,id,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,0,100 Grand,Yes,No,Yes,No,No,Yes,No,Yes,No,0.732,0.860,66.971725
1,1,3 Musketeers,Yes,No,No,No,Yes,No,No,Yes,No,0.604,0.511,67.602936
2,2,Air Heads,No,Yes,No,No,No,No,No,No,No,0.906,0.511,52.341465
3,3,Almond Joy,Yes,No,No,Yes,No,No,No,Yes,No,0.465,0.767,50.347546
4,4,Baby Ruth,Yes,No,Yes,Yes,Yes,No,No,Yes,No,0.604,0.767,56.914547
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78,78,Twizzlers,No,Yes,No,No,No,No,No,No,No,0.220,0.116,45.466282
79,79,Warheads,No,Yes,No,No,No,No,Yes,No,No,0.093,0.116,39.011898
80,80,Welch's Fruit Snacks,No,Yes,No,No,No,No,No,No,Yes,0.313,0.313,44.375519
81,81,Werther's Original Caramel,No,No,Yes,No,No,No,Yes,No,No,0.186,0.267,41.904308


### Display the mean for a particular column:

- Calculating the mean value for a particular column can help us understand its central tendency and make informed decisions. We will compute and display the mean for a selected column in the dataset.

In [30]:
# Find the mean winpercent
data.winpercent.mean()

50.58490762650603

In [31]:
# Display howmany competitors are both Chocolate and fruity
dataf = data['competitorname']
dataf.loc[(data.chocolate.isin(['Yes'])) & data.fruity.isin(["Yes"])]

72    Tootsie Pop
Name: competitorname, dtype: object

In [32]:
# Display how many competitors are both hard and bar.
dataf = data['competitorname']
dataf.loc[(data.hard == 'Yes') & (data.bar == 'Yes')]

Series([], Name: competitorname, dtype: object)

In [33]:
# Display which competitor has the higher win percent. 
data.loc[(data['winpercent'].idxmax())]

id                                         50
competitorname      Reese's Peanut Butter cup
chocolate                                 Yes
fruity                                     No
caramel                                    No
peanutyalmondy                            Yes
nougat                                     No
crispedricewafer                           No
hard                                       No
bar                                        No
pluribus                                   No
sugarpercent                             0.72
pricepercent                            0.651
winpercent                            84.1803
Name: 50, dtype: object

### Sort a given column:

- Sorting a column allows us to arrange the data in ascending or descending order based on its values. We will demonstrate how to sort a given column and discuss its significance in data analysis.

In [15]:
# Sort the Competitors by winpercent
data.sort_values(['winpercent'], ascending=False)

Unnamed: 0,id,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
50,50,Reese's Peanut Butter cup,Yes,No,No,Yes,No,No,No,No,No,0.720,0.651,84.180290
49,49,Reese's Miniatures,Yes,No,No,Yes,No,No,No,No,No,0.034,0.279,81.866257
77,77,Twix,Yes,No,Yes,No,No,Yes,No,Yes,No,0.546,0.906,81.642914
26,26,Kit Kat,Yes,No,No,No,No,Yes,No,Yes,No,0.313,0.511,76.768600
62,62,Snickers,Yes,No,Yes,Yes,Yes,No,No,Yes,No,0.546,0.651,76.673782
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24,24,Jawbusters,No,Yes,No,No,No,No,Yes,No,Yes,0.093,0.511,28.127439
70,70,Super Bubble,No,Yes,No,No,No,No,No,No,No,0.162,0.116,27.303865
10,10,Chiclets,No,Yes,No,No,No,No,No,No,Yes,0.046,0.325,24.524988
5,5,Boston Baked Beans,No,No,No,Yes,No,No,No,No,Yes,0.313,0.511,23.417824


---

I have given the sample code with outputs above. By covering these tasks in our notebook, you will have a solid foundation in basic preprocessing techniques. These technipues serve as the builing blocks for more advanced data preprocessing steps, such as handling missing values, encoding categorical variables, scaling features, and more. 

Throughout the notebook, we will use the powerful pandas library, which provides efficient data structures and functions for data manipulation and analysis. You will learn how to leverage pandas' functionalities to import datasets, explore data characteristics, perform data slicing, calculate statistics, and sort columns. 

Remember, data preprocessing is a crucial step in the data analysis workflow. By effectively cleaning and preparing your data, you can ensure that your subsequent analysis or machine learning models yield accurate and reliable results.

## Contact

If you have any questions, suggestions, or feedback regarding this IPython Notebook or any other topic related to data preprocessing, please feel free to reach out.

You can contact me at:

- Name: [Ruban Gino Singh](https://rubangino.in/)

- Email: [info@rubangino.in](https://mailto:info@rubangino.in/)   

- GitHub: [Ruban2205](https://github.com/Ruban2205/)

- LinkedIn: [ruban-gino-singh](https://www.linkedin.com/in/ruban-gino-singh/)

I am always eager to connect with fellow data enthusiasts and assist in any way possible. Don't hesitate to get in touch if you need further clarification or assistance with the preprocessing techniques covered in this notebook.

Happy preprocessing and data analysis!