## INDEX IN PYTHON

What is an index in Python: 

- In Python, indexing refers to the process of accessing a specific element in a sequence, such as a string or a list. Using its position or index number. 
- Indexing in Python **starts at 0**, which means that the **first** element in a sequence has an **index** of **0**, the **second** element has an **index of 1** and so on 

In [4]:
# Check first our working directory
import os

wd = os.getcwd()
wd

'/home/pablo/Documents/Pablo_zorin/VS_Python_projects/07_Index_set_reset_Python'

### 1. Download data from UC Irvine Machine Learning to pracise index manipulations

Get Wime Quality data set from UC Irvine Machine Leraning Repository website

https://archive.ics.uci.edu/dataset/186/wine+quality


From the above website we download zipped file "wine+quality.zip" file. I have renamed it as "wine_quality.zip" to ease any further data manipulation

#### 1.1 Unzip file to extract wine quality files

https://www.geeksforgeeks.org/unzipping-files-in-python/

We will use zipfile function to extract files included in a zip file

- First we import zipfile module
- Then we create a zip file object using ZipFile class
- Then we call the extrtactall() method on zip file object and pass the path where the files needded to be exrtacted
- Extract all files present in the zip file

In [5]:
from zipfile import ZipFile

From the path we obtained at the start of this script we can locate our \data folder

In [12]:
# Load the temp .zip and create a zip object
with ZipFile(r'/home/pablo/Documents/Pablo_zorin/VS_Python_projects/data/wine_quality.zip','r') as zObject:
# Extract all contents from the zipped file to a specific location
# We use extractall() method on that zObject we have just created
    zObject.extractall(
        path= r'/home/pablo/Documents/Pablo_zorin/VS_Python_projects/data'
)

Now we can see how two new .csv files about wine quality have been extracted to the /data folder

In [14]:
os.listdir(r'/home/pablo/Documents/Pablo_zorin/VS_Python_projects/data')

['wine_quality.zip',
 'winequality.names',
 'housing.csv',
 'Month_Value_1.csv',
 'winequality-red.csv',
 'winequality-white.csv',
 'Sample - Superstore.xls',
 'monthly-milk-production-pounds.csv']

We can specifically search for .csv files. We need to use the glob method

In [15]:
import glob
# Path to search files inside a specific sub-folder
# We specify the file extension (*.csv) we want to search inside our project folder
path = r'/home/pablo/Documents/Pablo_zorin/VS_Python_projects/data/*.csv'
files = glob.glob(path)
print(files)

['/home/pablo/Documents/Pablo_zorin/VS_Python_projects/data/housing.csv', '/home/pablo/Documents/Pablo_zorin/VS_Python_projects/data/Month_Value_1.csv', '/home/pablo/Documents/Pablo_zorin/VS_Python_projects/data/winequality-red.csv', '/home/pablo/Documents/Pablo_zorin/VS_Python_projects/data/winequality-white.csv', '/home/pablo/Documents/Pablo_zorin/VS_Python_projects/data/monthly-milk-production-pounds.csv']


### 2. Import White wine data set into Python 

Now we use Pandas to import white wine .csv file into Python

In [10]:
import pandas as pd

In [11]:
white_wine = pd.read_csv("/home/pablo/Documents/Pablo_zorin/VS_Python_projects/data/winequality-white.csv",sep =";")


Check overall structure of the white wine imported file

In [12]:
white_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


#### 2.1 Get white wine dataset details 

Get database overview

In [None]:
print(white_wine)   # 1.Get dataset overview
white_wine.info()   # 2. Get dataset details
print(white_wine.columns) # 3. Get dataset column names

We will run the above three descriptive functions

1. Get dataset overview

In [20]:
print(white_wine)

      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0               7.0              0.27         0.36            20.7      0.045   
1               6.3              0.30         0.34             1.6      0.049   
2               8.1              0.28         0.40             6.9      0.050   
3               7.2              0.23         0.32             8.5      0.058   
4               7.2              0.23         0.32             8.5      0.058   
...             ...               ...          ...             ...        ...   
4893            6.2              0.21         0.29             1.6      0.039   
4894            6.6              0.32         0.36             8.0      0.047   
4895            6.5              0.24         0.19             1.2      0.041   
4896            5.5              0.29         0.30             1.1      0.022   
4897            6.0              0.21         0.38             0.8      0.020   

      free sulfur dioxide  

2. Get dataset details

In [21]:
white_wine.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB


3. Get dataset column name

In [15]:
print(white_wine.columns)

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')


#### 2.2 Create new variables 

To create new variables in Python we first specify the data frame where we want to create the new varible and then we specify the new variable in square brackets and write its new name in commas, then we use = sign to assign the specific value to the new variable we've just created

For example, to create a new variable called 'is_red' that takes value 0 in the white_wine data frame we will do it this way: 

In [16]:
white_wine['is_red']=0.0

We can see how the new variable 'is red' has been created in our original data frame when using .info() method

In [17]:
white_wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
 12  is_red                4898 non-null   float64
dtypes: float64(12), int64(1)
memory usage: 497.6 KB


For example, to create a new variable called 'is_white' that takes value 1 in the white_wine data frame we will do it this way: 

In [18]:
white_wine['is_white']=1.0

We can see how the new variable 'is_white' has been created

In [19]:
white_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,is_red,is_white
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,0.0,1.0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,0.0,1.0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,0.0,1.0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,0.0,1.0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,0.0,1.0


In [29]:
white_wine_checks = white_wine.copy()
white_wine_checks.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,is_red,is_white
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,0.0,1.0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,0.0,1.0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,0.0,1.0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,0.0,1.0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,0.0,1.0


In [31]:
type(white_wine_checks)

pandas.core.frame.DataFrame

#### 2.3 Subset variables from DataFrame

2.3.1 To select a single column

Each *column* in a *DataFrame* is a *Series*. As a single column is selected, the returned object is a pandas **series**

In [32]:
One_col = white_wine_checks["quality"]
One_col

0       6
1       6
2       6
3       6
4       6
       ..
4893    6
4894    5
4895    6
4896    7
4897    6
Name: quality, Length: 4898, dtype: int64

We can check how this is a *Series* object

In [34]:
type(white_wine_checks["quality"])

pandas.core.series.Series

In [36]:
white_wine_checks["quality"].shape

(4898,)

Using DataFrame.shape attribute, returns the number of rows and columns: (nrows,ncolumns).


A *pandas* **Series** is a **1-dimensional** object and only the number of **ROWS** is returned

2.3.2 To select multiple columns

To select **multiple** columns, you use a **list of columns** names **within** the selection brackets []

In [43]:
#m This is a list

My_list = ['red','blue','green']
My_list


['red', 'blue', 'green']

In [41]:
type(My_list)

list

So you apply this principle. To subset columns from a data frame, we enclose the list of columns we want to subset from the original data frame into square brackets []

Basically you subset a list of columns by creating a list of columns '[]' and then you **enclose** that list into another set of square brackets '[[]]'. Think about it as a list with **TWO** sets of square brackets

In [37]:
white_subset = white_wine_checks[['quality','is_red','is_white']]

In [38]:
white_subset.head()

Unnamed: 0,quality,is_red,is_white
0,6,0.0,1.0
1,6,0.0,1.0
2,6,0.0,1.0
3,6,0.0,1.0
4,6,0.0,1.0


In [44]:
type(white_subset)

pandas.core.frame.DataFrame

In [None]:
type(white_subset)

pandas.core.frame.DataFrame

The above Data frame is made of just **Three** columns from the original data frame

### 3. Import Red wine data set into Python 

#### And create two same new variables "is_red" and "is_white"

Now we use Pandas to import red wine .csv file into Python

In [2]:
import pandas as pd

In [3]:
red_wine = pd.read_csv("/home/pablo/Documents/Pablo_zorin/VS_Python_projects/data/winequality-red.csv",sep =";")

In [4]:
red_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [5]:
red_wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Create two new variables in this red_wine data setL: 'is red' equals to 0.0 and 'is white' equals to 1.0

In [6]:
red_wine['is_red'] = 0.0
red_wine['is_white'] = 1.0

Now we check the two new variables in our red_wine data frame

In [7]:
red_wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
 12  is_red                1599 non-null   float64
 13  is_white              1599 non-null   float64
dtypes: float64(13), int64(1)
memory usage: 175.0 KB


### 4. Concatenate both red_wine and white_wine dataframes

We use function pd.concat() to concatenate two data frames

- It takes two arguments:
    - **pd.concat([dataframeA,dataframeB,axis])**
    - First the data frames we want to concatenate as a list [dataframeA,dataframeB]
    - Then the axis, how to concatenate the data frames

- In Pandas: 
    - axis = 0 refers to horizontal axis or *ROWS*
    - axis = 1 refers to vertical axis or *COLUMNS*

We use pd.concat() with **axis =0** as we want to stack our data in ROWS, one on top of the other. To Union data frames in Python we use **axis = 0** 

- Template to concatenate ROWS: data_df = pd.concat([data_frameA,data_frameB, axis = 0])

Check first we have same number of VARIABLES in both data frames **white_wine** and **red_wine** before we concateante them

By using .info() method, we obtain number of columns of our Dataframe as well as the data types of each column. This Method prints the information or summary of the dataframe

In [20]:
white_wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
 12  is_red                4898 non-null   float64
 13  is_white              4898 non-null   float64
dtypes: float64(13), int64(1)
memory usage: 535.8 KB


In [22]:
len(white_wine)

4898

We do the same for the red_whine data frame

In [21]:
red_wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
 12  is_red                1599 non-null   float64
 13  is_white              1599 non-null   float64
dtypes: float64(13), int64(1)
memory usage: 175.0 KB


In [23]:
len(red_wine)

1599

### 4.1 Check lenght of each data frame before we union them

Before we union them we run a simple check to find out which will be the total number of rows of our unioned Dataframe

In [24]:
len(white_wine)

4898

In [25]:
len(red_wine)

1599

In [27]:
Total = 4898 + 1599
Total

6497

We have same number of columns or variables in our dataframes, so we can union them. When using .info() argument we have seen that in both instances we have 14 columns 

In [29]:
all_wine = pd.concat([white_wine,red_wine],axis =0)

In [30]:
all_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,is_red,is_white
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,0.0,1.0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,0.0,1.0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,0.0,1.0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,0.0,1.0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,0.0,1.0


In [31]:
all_wine.tail()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,is_red,is_white
1594,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5,0.0,1.0
1595,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,0.0,1.0
1596,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,0.0,1.0
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,0.0,1.0
1598,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6,0.0,1.0


We can see that he total number of rows 1598 **DOES NOT MATCH** the sum of individual rows from Dataframe white_wine and red_wine. This is a case then we have to **reset the index**

### 5. Reset index in our merged Dataframe

This website below provides a good guide on when you have to reset the Index:

https://saturncloud.io/blog/how-to-reset-index-in-a-pandas-dataframe/#:~:text=If%20we%20want%20to%20change,help%20to%20avoid%20such%20conflicts

There are a few reasons why we might want to reset the index of a pandas dataframe:

1. **Missing or duplicate index values**: Sometimes, the index values might be missing or duplicated. In such cases, resetting the index can help to reassign new index values to the dataframe.

2. **Change the order of rows**: By default, the rows in a dataframe are ordered by their index values. If we want to change the order of rows based on some other criteria, resetting the index can help to sort the rows based on a different column.

3. **Merge or join dataframes**: When we merge or join two or more dataframes, we might end up with duplicate index values. Resetting the index can help to avoid such conflicts.

As we have UNIONED two data frames, we will recommend to reset the index of the merged dataframe

all_wine = pd.concat([white_wine,red_wine],axis =0)

This was the union we performed, and we saw that we have to reset its index. As It should read 6496 on the last row

In [32]:
all_wine.tail()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,is_red,is_white
1594,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5,0.0,1.0
1595,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,0.0,1.0
1596,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,0.0,1.0
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,0.0,1.0
1598,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6,0.0,1.0


We use method .reset_index() to reset an index of the Pandas dataframe. To avoid duplicated indexes, is always recommended to include (drop = True) when using that method

In [33]:
all_wine_reset = all_wine.reset_index(drop = True)
all_wine_reset.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,is_red,is_white
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,0.0,1.0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,0.0,1.0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,0.0,1.0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,0.0,1.0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,0.0,1.0


In [34]:
all_wine_reset.tail()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,is_red,is_white
6492,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5,0.0,1.0
6493,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,0.0,1.0
6494,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,0.0,1.0
6495,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,0.0,1.0
6496,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6,0.0,1.0


Now we have our dataframe with the right index after we have performed a union of both red and white sets

The final step will be to export this new dataframe as a .csv file names "all_wine_reset.csv"

In [38]:
all_wine_reset.to_csv(r"/home/pablo/Documents/Pablo_zorin/VS_Python_projects/data/all_wine_reset.csv")

Remember to write the full path when writting out files in Python