<h1 style="color:orange;">Pandas in Python</h1>
<p style="font-size:18px;">Pandas is a powerful and popular data manipulation and analysis library in Python. It provides data structures like Series and DataFrame that are well-suited for working with structured data, such as spreadsheets or SQL tables. Pandas makes it easy to clean, analyze, and visualize data, making it a fundamental tool for data scientists, analysts, and engineers. Here's an overview of how to use Pandas in Python:</p>

<h3 style="color:orange;">Dictionary to DataFrame</h3>
<p style="font-size:18px;">Pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python. Sounds promising!

The DataFrame is one of Pandas' most important data structures. It's basically a way to store tabular data where you can label the rows and the columns. One way to build a DataFrame is from a dictionary.

In the exercises that follow you will be working with vehicle data from different countries. Each observation corresponds to a country and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on.</p>

<li>names, containing the country names for which data is available.</li>
<li>dr, a list with booleans that tells whether people drive left or right in the corresponding country.</li>
<li>cpc, the number of motor vehicles per 1000 people in the corresponding country.</li>

<p>Each dictionary key is a column label and each value is a list which contains the column elements.</p>

<li>Import pandas as pd.</li>
<li>Use the pre-defined lists to create a dictionary called my_dict. There should be three key value pairs:</li>
<li>key 'country' and value names.</li>
<li>key 'drives_right' and value dr.</li>
<li>key 'cars_per_cap' and value cpc.</li>
<li>Use pd.DataFrame() to turn your dict into a DataFrame called cars.</li>
<li>Print out cars and see how beautiful it is.</li>

In [2]:
# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Import pandas as pd
import pandas as pd


# Create dictionary my_dict with three key:value pairs: my_dict
my_dict ={
    'country': names,
    'drives_right':dr,
    'cars_per_cap':cpc
}


# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)


# Print cars
print(cars)

         country  drives_right  cars_per_cap
0  United States          True           809
1      Australia         False           731
2          Japan         False           588
3          India         False            18
4         Russia          True           200
5        Morocco          True            70
6          Egypt          True            45


<p>The Python code that solves the previous exercise is included in the script. Have you noticed that the row labels (i.e. the labels for the different observations) were automatically set to integers from 0 up to 6?

To solve this a list row_labels has been created. You can use it to specify the row labels of the cars DataFrame. You do this by setting the index attribute of cars, that you can access as cars.index.</p>

<li>Specify the row labels by setting cars.index equal to row_labels.</li>
<li>Print out cars again and check if the row labels are correct this time.</li>

In [3]:
import pandas as pd

# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
cars_dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(cars_dict)


# Definition of row_labels
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars
cars.index =row_labels


# Print cars again
print(cars)

           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JPN          Japan         False           588
IN           India         False            18
RU          Russia          True           200
MOR        Morocco          True            70
EG           Egypt          True            45


<h3 style="color:orange;">CSV to DataFrame</h3>
<p>Putting data in a dictionary and then building a DataFrame works, but it's not very efficient. What if you're dealing with millions of observations? In those cases, the data is typically available as files with a regular structure. One of those file types is the CSV file, which is short for "comma-separated values".

To import CSV data into Python as a Pandas DataFrame you can use read_csv().

Let's explore this function with the same cars data from the previous exercises. This time, however, the data is available in a CSV file, named cars.csv. It is available in your current working directory, so the path to the file is simply 'cars.csv'.
<li>To import CSV files you still need the pandas package: import it as pd.</li>
<li>Use pd.read_csv() to import cars.csv data as a DataFrame. Store this DataFrame as cars.</li>
<li>Print out cars. Does everything look OK?</li>
</p>

In [4]:
# Import pandas as pd
import pandas as pd


# Import the cars.csv data: cars
cars = pd.read_csv('C:/Users/RBTG/OneDrive/Desktop/Data science/data/cars.txt')




# Print out cars
print(cars)

  Unnamed: 0        country drives_right  cars_per_cap
0         US  United States         True           809
1        AUS      Australia        False           731
2        JPN          Japan        False           588
3         IN          India        False            18
4         RU         Russia         True           200
5        MOR        Morocco         True            70
6         EG          Egypt         True            45


<p>Your read_csv() call to import the CSV data didn't generate an error, but the output is not entirely what we wanted. The row labels were imported as another column without a name.

Remember index_col, an argument of read_csv(), that you can use to specify which column in the CSV file should be used as a row label? Well, that's exactly what you need here!

Python code that solves the previous exercise is already included; can you make the appropriate changes to fix the data import?
<li>Run the code with Run Code and assert that the first column should actually be used as row labels.

</li>
<li>Specify the index_col argument inside pd.read_csv(): set it to 0, so that the first column is used as row labels.</li>
<li>Has the printout of cars improved now?</li>
</p>

In [5]:
# Import pandas as pd
import pandas as pd

# Fix import by including index_col
cars = pd.read_csv('C:/Users/RBTG/OneDrive/Desktop/Data science/data/cars.txt',index_col = 0)

# Print out cars
print(cars)

           country drives_right  cars_per_cap
US   United States         True           809
AUS      Australia        False           731
JPN          Japan        False           588
IN           India        False            18
RU          Russia         True           200
MOR        Morocco         True            70
EG           Egypt         True            45


<h3 style="color:orange;">Accessing DataFrames</h3>
<h3 style="color:orange;">Square Brackets Technique</h3>
<p style="font-size:18px;">In the sample code, the same cars data is imported from a CSV files as a Pandas DataFrame.<br> To select only the cars_per_cap column from cars, you can use:

cars['cars_per_cap'] <br>
cars[['cars_per_cap']] <br>
The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.</p>

<li>Use single square brackets to print out the country column of cars as a Pandas Series.</li>
<li>Use double square brackets to print out the country column of cars as a Pandas DataFrame.</li>
<li>Use double square brackets to print out a DataFrame with both the country and drives_right columns of cars, in this order.</li>



In [11]:
# Import cars data
import pandas as pd
cars = pd.read_csv('C:/Users/RBTG/OneDrive/Desktop/Data science/data/cars.txt', index_col = 0)

# Print out country column as Pandas Series
print(cars['country'])



# Print out country column as Pandas DataFrame
print(cars[['country']])


# Print out DataFrame with country and drives_right columns
print(cars[['country','drives_right']])

US     United States
AUS        Australia
JPN            Japan
IN             India
RU            Russia
MOR          Morocco
EG             Egypt
Name: country, dtype: object
           country
US   United States
AUS      Australia
JPN          Japan
IN           India
RU          Russia
MOR        Morocco
EG           Egypt
           country drives_right
US   United States         True
AUS      Australia        False
JPN          Japan        False
IN           India        False
RU          Russia         True
MOR        Morocco         True
EG           Egypt         True


<p style='font-size:18px;'><span style="color:orange;">Square brackets</span> can do more than just selecting columns. You can also use them to get rows, or observations, from a DataFrame. The following call selects the first five rows from the cars DataFrame:

cars[0:5]
The result is another DataFrame containing only the rows you specified.

<span style="color:red">Pay attention:</span> You can only select rows using square brackets if you specify a slice, like 0:4. Also, you're using the integer indexes of the rows here, not the row labels!</p>

<li>Select the first 3 observations from cars and print them out.</li>
<li>Select the fourth, fifth and sixth observation, corresponding to row indexes 3, 4 and 5, and print them out.</li>

In [13]:
# Import cars data
import pandas as pd
cars = pd.read_csv('C:/Users/RBTG/OneDrive/Desktop/Data science/data/cars.txt', index_col = 0)

# Print out first 3 observations
print(cars[0:3])


# Print out fourth, fifth and sixth observation
print(cars[3:6])

           country drives_right  cars_per_cap
US   United States         True           809
AUS      Australia        False           731
JPN          Japan        False           588
     country drives_right  cars_per_cap
IN     India        False            18
RU    Russia         True           200
MOR  Morocco         True            70


<p style='font-size:18px;'><span style="color:orange;">loc and iloc</span> With loc and iloc you can do practically any data selection operation on DataFrames you can think of. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.

Try out the following commands in the IPython Shell to experiment with loc and iloc to select observations. Each pair of commands here gives the same result.

cars.loc['RU'] <br>
cars.iloc[4]<br>

cars.loc[['RU']]<br>
cars.iloc[[4]]<br>

cars.loc[['RU', 'AUS']]<br>
cars.iloc[[4, 1]]<br>
As before, code is included that imports the cars data as a Pandas DataFrame.</p>

<li>Use loc or iloc to select the observation corresponding to Japan as a Series. The label of this row is JPN, the index is 2. Make sure to print the resulting Series.
</li>
<li>Use loc or iloc to select the observations for Australia and Egypt as a DataFrame.</li> <br>You can find out about the labels/indexes of these rows by inspecting cars in the IPython Shell. Make sure to print the resulting DataFrame.


In [14]:
# Print out observation for Japan
print(cars.loc['JPN'])
print(cars.iloc[2])

# Print out observations for Australia and Egypt
print(cars.loc[['AUS','EG']])
print(cars.iloc[[1,6]])

country         Japan
drives_right    False
cars_per_cap      588
Name: JPN, dtype: object
country         Japan
drives_right    False
cars_per_cap      588
Name: JPN, dtype: object
       country drives_right  cars_per_cap
AUS  Australia        False           731
EG       Egypt         True            45
       country drives_right  cars_per_cap
AUS  Australia        False           731
EG       Egypt         True            45
