# Getting Started
In this lab we will start by practicing loading datasets and treating it as dataframe.

In [None]:
import pandas as pd
import numpy as np

The above python code imports two libraries, `pandas` and `numpy`.

`pandas` is a library that provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is widely used for data cleaning, data manipulation and data analysis tasks, and is particularly useful for working with tabular data in the form of dataframes.

`numpy` is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. It is widely used for mathematical and scientific computations in Python.

# Mounting the Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


This code snippet is using the `google.colab` library, which is a module provided by Google Colaboratory. It is used to interact with the Colaboratory virtual machine (VM) and with the files stored in Google Drive.

`drive.mount('/content/drive')` is used to mount the Google Drive to the virtual machine. It will prompt you to enter the authorization code which is obtained by visiting the authorization URL. After providing the code, it will mount the drive and you will be able to access the files stored in your google drive. The drive will be mounted to the directory '/content/drive' and you can access the files stored in it using this path.

**Reading a csv file**


In [None]:
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Data/Iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


This code snippet is using the `pandas` library to read a CSV file called '`Iris.csv`' which is located in the directory '`/content/drive/MyDrive/Colab Notebooks/Data/`'.

`pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Data/Iris.csv')` reads the CSV file using the `read_csv()` function from the `pandas` library. This function returns a DataFrame object, which is a two-dimensional size-mutable, tabular data structure with rows and columns. This DataFrame object is stored in the variable `df`.

`df.head()` is used to display the first 5 rows of the DataFrame. This function is useful to quickly check the data and make sure that it is loaded correctly.

In [None]:
df.tail(3)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


`df.tail(3)` is used to display the last 3 rows of the DataFrame. This function is similar to `df.head()` but it returns the last n rows of the DataFrame instead of the first n. It is useful to check the last rows of the DataFrame, especially when working with large dataset.

**Dataframe Properties**

In [None]:
print(df.shape)
print(df.ndim)
print(len(df))

(150, 6)
2
150


`df.shape` returns a tuple representing the dimensions of the DataFrame. The tuple contains the number of rows and columns, in that order.

`df.ndim` returns the number of dimensions of the DataFrame. Since a DataFrame is two-dimensional, it will return 2.

`len(df)` returns the number of rows in the DataFrame. It's the first element of shape tuple.

All of this information can be used to understand the size and structure of the data stored in the DataFrame, which can be useful for debugging and data analysis.

**Access & Slicing of a Dataframe (`loc` et `iloc`)**

In [None]:
#df.loc[:, ['SepalLengthCm',	'SepalWidthCm']]
df.loc[0:5, ['SepalLengthCm',	'SepalWidthCm']]

Unnamed: 0,SepalLengthCm,SepalWidthCm
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
5,5.4,3.9


The above code snippet is using the `loc` property of the DataFrame, which is used to access a group of rows and columns by labels or a boolean array.

`df.loc[:, ['SepalLengthCm', 'SepalWidthCm']]` is used to select all rows for the columns '`SepalLengthCm`' and '`SepalWidthCm`'. It returns a new DataFrame containing only these columns.

`df.loc[0:5, ['SepalLengthCm', 'SepalWidthCm']]` is used to select rows from 0 to 5 and the columns '`SepalLengthCm`' and '`SepalWidthCm`'. It returns a new DataFrame containing only these rows and columns.

Both of the above code snippets are useful for selecting specific columns or rows from the DataFrame for further analysis or processing.

It's worth to mention that python indexing starts at 0, so `df.loc[0:5, ...]` will select rows 0, 1, 2, 3, 4, 5.

In [None]:
df.iloc[:, [1,2]]

Unnamed: 0,SepalLengthCm,SepalWidthCm
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


`df.iloc[:, [1,2]]` is similar to `df.loc[:, ['SepalLengthCm', 'SepalWidthCm']]` but it uses integer-based indexing instead of label-based indexing.

`df.iloc[]` is used to select rows and columns by integer-based location. In this case, `df.iloc[:, [1,2]]` selects all rows (`:`) and columns 1 and 2. It returns a new DataFrame containing only these columns.

Both `df.loc[]` and `df.iloc[]` can be used to select specific rows and columns from a DataFrame, but `df.loc[]` uses labels or boolean arrays to select rows and columns while `df.iloc[]` uses integer-based locations.

In [None]:
df_data=df[df['Species']=='Iris-virginica']
df_data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
100,101,6.3,3.3,6.0,2.5,Iris-virginica
101,102,5.8,2.7,5.1,1.9,Iris-virginica
102,103,7.1,3.0,5.9,2.1,Iris-virginica
103,104,6.3,2.9,5.6,1.8,Iris-virginica
104,105,6.5,3.0,5.8,2.2,Iris-virginica
105,106,7.6,3.0,6.6,2.1,Iris-virginica
106,107,4.9,2.5,4.5,1.7,Iris-virginica
107,108,7.3,2.9,6.3,1.8,Iris-virginica
108,109,6.7,2.5,5.8,1.8,Iris-virginica
109,110,7.2,3.6,6.1,2.5,Iris-virginica


`df[df['Species']=='Iris-virginica']` is used to filter the rows of the DataFrame `df` based on a condition. In this case, it returns a new DataFrame containing only the rows where the column '`Species`' is equal to '`Iris-virginica`'.

This is known as boolean indexing, where a boolean expression is passed as an index to select the rows from a DataFrame that satisfy the given condition.

The filtered DataFrame is assigned to the variable `df_data` and finally, the last line is calling the `df_data` variable, which will print the filtered dataframe.

This can be useful when you want to analyze or work with a specific subset of your data.

In [None]:
df.iloc[:,1]

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: SepalLengthCm, Length: 150, dtype: float64

The colon (`:`) in `df.iloc[:,1]` indicates that all rows should be selected, and the 1 after the colon indicates that the second column should be selected. The result is a pandas Series containing all the values from the second column of the DataFrame.

In [None]:
df.iloc[1,:]

Id                         2
SepalLengthCm            4.9
SepalWidthCm             3.0
PetalLengthCm            1.4
PetalWidthCm             0.2
Species          Iris-setosa
Name: 1, dtype: object

The colon (`:`) in `df.iloc[1,:]` indicates that all columns should be selected, and the 1 before the colon indicates that the second row should be selected. The result is a pandas Series or DataFrame containing all the values from the second row of the DataFrame.

In [None]:
print(type(df['PetalLengthCm']))

<class 'pandas.core.series.Series'>


This command is used to determine the data type of the column '`PetalLengthCm`' in a pandas DataFrame. The command will access the column '`PetalLengthCm`' in the DataFrame '`df`' using the square brackets notation and then pass it to the built-in `type()` function. The `type()` function returns the class type of the object passed to it as an argument. In this case, it will return the class type of the column '`PetalLengthCm`' which should be a pandas Series.

In [None]:
df.loc[:, ['SepalLengthCm',	'SepalWidthCm']].values


array([[5.1, 3.5],
       [4.9, 3. ],
       [4.7, 3.2],
       [4.6, 3.1],
       [5. , 3.6],
       [5.4, 3.9],
       [4.6, 3.4],
       [5. , 3.4],
       [4.4, 2.9],
       [4.9, 3.1],
       [5.4, 3.7],
       [4.8, 3.4],
       [4.8, 3. ],
       [4.3, 3. ],
       [5.8, 4. ],
       [5.7, 4.4],
       [5.4, 3.9],
       [5.1, 3.5],
       [5.7, 3.8],
       [5.1, 3.8],
       [5.4, 3.4],
       [5.1, 3.7],
       [4.6, 3.6],
       [5.1, 3.3],
       [4.8, 3.4],
       [5. , 3. ],
       [5. , 3.4],
       [5.2, 3.5],
       [5.2, 3.4],
       [4.7, 3.2],
       [4.8, 3.1],
       [5.4, 3.4],
       [5.2, 4.1],
       [5.5, 4.2],
       [4.9, 3.1],
       [5. , 3.2],
       [5.5, 3.5],
       [4.9, 3.1],
       [4.4, 3. ],
       [5.1, 3.4],
       [5. , 3.5],
       [4.5, 2.3],
       [4.4, 3.2],
       [5. , 3.5],
       [5.1, 3.8],
       [4.8, 3. ],
       [5.1, 3.8],
       [4.6, 3.2],
       [5.3, 3.7],
       [5. , 3.3],
       [7. , 3.2],
       [6.4, 3.2],
       [6.9,

This command is used to select all rows for specific columns by label name in a pandas DataFrame. The `.loc` attribute is used to select rows and columns by their label, instead of by index position. The colon (`:`) in `df.loc[:, ['SepalLengthCm', 'SepalWidthCm']]`  indicates that all rows should be selected. The double square brackets indicate that a list of column names is being passed as an argument, so that all rows for the columns '`SepalLengthCm`' and '`SepalWidthCm`' will be selected. The .values attribute is used to return the data in the selected columns as a numpy array.

**Creating a dataframe using a list/table ndarray or a dictionary**

In [None]:
list_eg=[[1,2],[3,5]]
df1=pd.DataFrame(list_eg, columns=['A1','A2'])
df1

Unnamed: 0,A1,A2
0,1,2
1,3,5


This code is creating a new DataFrame called '`df1`' using a list of lists called '`list_eg`'. The list of lists is used as the data for the DataFrame and consists of two lists, each with two values. The `pd.DataFrame()` function is used to create a new DataFrame and takes two arguments: the data to be used and the columns names. The columns parameter is set to `['A1','A2']` so that the columns in the DataFrame will be named '`A1`' and '`A2`'.

In [None]:
tab0=np.arange(4)
print(tab0)
tab=np.arange(4).reshape((2,2))
print(tab)
df2=pd.DataFrame(tab, columns=['A1','A2'])
df2

[0 1 2 3]
[[0 1]
 [2 3]]


Unnamed: 0,A1,A2
0,0,1
1,2,3


This code is creating a new DataFrame called '`df2`' using a numpy array called '`tab`'. The numpy array is first created using the `np.arange(4)` function, which creates an array with 4 elements, starting from 0 and increasing by 1 for each element. The `reshape()` function is then used to reshape the array into a 2x2 matrix.

The `np.arange(4)` function is used to create a one-dimensional numpy array with 4 elements, starting from 0 and increasing by 1 for each element. The array created by this function would look like this: `[0, 1, 2, 3]`.

The `reshape` function `reshape((2,2))` is used to reshape this 1-D array into a 2x2 matrix. So the final array will be of shape (2,2) with elements `[0,1]` as first row and `[2,3]` as second row.

The `pd.DataFrame()` function is then used to create a new DataFrame from the numpy array. The `tab` is passed as the data for the DataFrame, and the columns parameter is set to `['A1','A2']` so that the columns in the DataFrame will be named '`A1`' and '`A2`'.

In [None]:
dictionary={'A1':[30,40], 'A2':[20,20]}
df3=pd.DataFrame(dictionary, columns=["A1","A2"])
df3

Unnamed: 0,A1,A2
0,30,20
1,40,20


`dictionary={'A1':[30,40], 'A2':[20,20]}`: This line creates a dictionary called 'dictionary' with two keys, '`A1`' and '`A2`', and each key has a list of values. The key '`A1`' has the values `[30,40]` and the key '`A2`' has the values `[20,20]`

`df3=pd.DataFrame(dictionary, columns=["A1","A2"])`: This line creates a new DataFrame called '`df3`' using the '`dictionary`' as the data and the column names are passed using the columns parameter which is set to `["A1","A2"]`. This ensures that the columns in the DataFrame match the keys of the dictionary.

`df3`: This line is used to display the DataFrame '`df3`' in the output.

### Other Utilities

In [None]:
list(df)

['Id',
 'SepalLengthCm',
 'SepalWidthCm',
 'PetalLengthCm',
 'PetalWidthCm',
 'Species']

This command is used to get the column names of a DataFrame as a list. The df here is a DataFrame object. The `list()` function is used to convert the column names to a list.

So the final output will be a list of column names of the DataFrame 'df'.

You can also use `df.columns.tolist()` to get the column names as a list.

In [None]:
df4=pd.concat([df1,df2], axis=1)
df4

Unnamed: 0,A1,A2,A1.1,A2.1
0,1,2,0,1
1,3,5,2,3


This code creates a new DataFrame called '`df4`' by concatenating two DataFrames '`df1`' and '`df2`' together. The `pd.concat()` function is used for this purpose and takes two arguments: a list of DataFrames to be concatenated, and the axis along which the DataFrames should be concatenated.

In this case, the `axis` parameter is set to 1, which means that the DataFrames will be concatenated along the columns. So the two DataFrames '`df1`' and '`df2`' will be stacked on top of each other, with the columns of '`df1`' on the left and the columns of '`df2`' on the right.

In [None]:
df5=pd.concat([df1,df2],axis=0)
df5

Unnamed: 0,A1,A2
0,1,2
1,3,5
0,0,1
1,2,3


This code creates a new DataFrame called '`df5`' by concatenating two DataFrames '`df1`' and '`df2`' together. The `pd.concat()` function is used for this purpose and takes two arguments: a list of DataFrames to be concatenated and the axis along which the DataFrames should be concatenated.

In this case, the `axis` parameter is set to `0`, which means that the DataFrames will be concatenated along the rows. So the rows of '`df1`' will be stacked above the rows of '`df2`', resulting in a new DataFrame with 4 rows.

In [None]:
df5.reset_index(drop=True, inplace=True)
df5

This code resets the index of the DataFrame '`df5`' using the `reset_index()` function. The `reset_index()` function is used to reset the index of a DataFrame and makes it a default one starting from 0 and going up by 1 for each row.

The `drop=True` parameter is passed to `reset_index()` to drop the current index and replace it with the default one. The `inplace=True` parameter is used to make the changes to the DataFrame in place, so that the original DataFrame is modified and no new one is created.

In [None]:
df5.rename(columns={'A1':'X'}, inplace=True)
df5

This code renames the column '`A1`' of the DataFrame '`df5`' to '`X`' using the `rename()` function. The `rename()` function is used to change the name of one or more columns in a DataFrame. It takes a dictionary as an argument, where the keys of the dictionary are the current column names and the values of the dictionary are the new column names.

In this case, the dictionary passed to `rename()` is `{'A1':'X'}`, which means the column '`A1`' will be renamed to '`X`'. The `inplace=True` parameter is used to make the changes to the DataFrame in place, so that the original DataFrame is modified and no new one is created.

In [None]:
df5.drop(index=[1,2], inplace=True)
df5

This code is used to drop rows at the specified indices from the DataFrame '`df5`' using the `drop()` function. The `drop()` function takes two arguments, the index or indices of the rows to be dropped and the axis along which the rows should be dropped. By default, the axis is set to 0, which means rows will be dropped.

In this case, the `index` parameter is set to `[1, 2]`, which means the rows at index 1 and 2 will be dropped from the DataFrame. The `inplace=True` parameter is used to make the changes to the DataFrame in place, so that the original DataFrame is modified and no new one is created.

After executing this code, the rows at index 1 and 2 will be dropped from the DataFrame '`df5`' and the original DataFrame '`df5`' will be modified.

In [None]:
df_data.to_csv('/content/drive/MyDrive/IrisModified.csv', index=False, sep=";")

This line of code is used to save the DataFrame '`df_data`' to a csv file named '`IrisModified.csv`' using the `to_csv()` function. The first argument to the `to_csv()` function is the file path where the csv file should be saved. The `index=False` parameter is passed to `to_csv()` to prevent the DataFrame index from being included in the csv file. The `sep=";"` parameter is passed to `to_csv()` to set the delimiter of the csv file to be "`;`".

It is saving the file on the specific directory '`/content/drive/MyDrive/IrisModified.csv`' on Google drive.

It will create a new csv file with the name '`IrisModified.csv`' in the specified directory, and it will contain the data from the DataFrame '`df_data`'.




