# Pandas in Python

![Pandas](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTkKmkBqviJKR3yCj5F251eRodlrKmubG6ey7pJMGGLMs2CF23gBT_4QevLGRVUjcSXSkQ&usqp=CAU)

Pandas is a library that facilitates working with tables similar to **Excel**. The difference is that Excel not being the ideal option for handling large amounts of data. This is precisely where Pandas comes in, enabling a variety of table-related operations on large datasets. The importance of Pandas cannot be overstated, as this library significantly contributes to the tremendous success of Python for data scientists. 

The documentation can be found [here](https://pandas.pydata.org/docs/index.html), and numerous learning resources and tutorials are available on the internet. The following document provides a brief introduction to Pandas and showcases a few important commands.

Tables are referred to as *dataframes*.

![Pandas](Dataframe.png)

After importing the library, a CSV file can be brought in and converted into a dataframe. For the sake of simplicity, we call the new dataframe "df."

The CSV file is located within the directory, containing sample images from *Flickr* that will serve us as an example file. 

## Read CSV files

[CSV reader](https://pandas.pydata.org/docs/user_guide/io.html#io-read-csv-table) documentation.

In [1]:
import pandas as pd

In [2]:
path = "Excercise 03 - Flickr.csv"

df = pd.read_csv(path, sep = ",")

df

Unnamed: 0,p_id,lat,lon,o_id,p_date,accuracy,title,tags,url
0,214028,51.504265,-0.078659,51035570500@N01,2003-05-24 17:40:38,16,"City Hall, London",london england normanfoster architecture geota...,https://live.staticflickr.com/1/214028_a96ba3e...
1,436102,51.505937,-0.082182,40385587@N00,2004-03-16 14:02:05,14,Angel. A Figurehead,london thames ship olympus c740 figurehead wys...,https://live.staticflickr.com/1/436102_7907e51...
2,436113,51.505323,-0.089650,40385587@N00,2004-03-16 14:04:28,14,Sign at London Bridge Station,signs london olympus lee c740 wysiwyg,https://live.staticflickr.com/1/436113_d87fc68...
3,459075,51.505062,-0.079017,98406434@N00,2004-06-27 00:00:00,16,London - City Hall looming,london towerbridge cityhall anyhoo photobyanyhoo,https://live.staticflickr.com/1/459075_44a6d3d...
4,459077,51.505062,-0.079017,98406434@N00,2004-06-27 00:00:00,16,London - City Hall reflection,reflection london thames cityhall anyhoo photo...,https://live.staticflickr.com/1/459077_74f6ecc...
...,...,...,...,...,...,...,...,...,...
89016,53282639969,51.505597,-0.083953,35234357@N04,2023-06-20 17:23:41,16,"Hays Galleria, London, 20th June 2023",20thjune june2023 london haysgalleria haysgall...,https://live.staticflickr.com/65535/5328263996...
89017,53282751605,51.504722,-0.081667,35234357@N04,2023-06-20 17:29:54,16,"Apart Together by Olivia Hylton, Part of the M...",20thjune june2023 london morph morphsculptures...,https://live.staticflickr.com/65535/5328275160...
89018,53283030331,51.505111,-0.089925,78364563@N00,2023-10-21 14:44:02,16,,,https://live.staticflickr.com/65535/5328303033...
89019,53283300699,51.505944,-0.088692,138177073@N04,2023-06-13 19:27:27,16,London,london england unitedkingdom,https://live.staticflickr.com/65535/5328330069...


## Accessing Data
You can access DataFrame data using these familiar looking operations:

In [6]:
#HOW TO ACCESS ROWS

# the first 4 rows
# df.head(4)

# the last 4 rows
# df.tail(4)

# # the rows between 5000 and 5003
# df[5000:5003]

# # or just row 8567 (index 8566)
df.iloc[8567]

p_id                                               2778949712
lat                                                 51.504896
lon                                                 -0.086688
o_id                                             83736792@N00
p_date                                    2008-06-01 17:35:17
accuracy                                                   14
title                                        20080601-City-03
tags        city uk england building london architecture n...
url         https://live.staticflickr.com/3213/2778949712_...
Name: 8567, dtype: object

In [8]:
#HOW TO ACCESS COLUMNS

# extract a column as a list. in this case extract all titles
# df["title"]

# get the title of row 40235
df["title"][40235]

'The Shard Opening 027'

## .loc and .iloc 

As you can see, the syntax is very simlar to lists. But the result is always a list. If you want to access a datapoint an expression like `df[8566]` gives you an error message. In these cases you have to use `.loc` and `.iloc`. 

In [10]:
# df[8566]

# get row 8567 (index 8566)
df.iloc[8567]

p_id                                               2778949712
lat                                                 51.504896
lon                                                 -0.086688
o_id                                             83736792@N00
p_date                                    2008-06-01 17:35:17
accuracy                                                   14
title                                        20080601-City-03
tags        city uk england building london architecture n...
url         https://live.staticflickr.com/3213/2778949712_...
Name: 8567, dtype: object

Use `iloc` when your index consists of integers, such as 1, 2, 3, etc., which is typically the default case.  

<div class="alert alert-block alert-info">
<b>Tip:</b>  

The Flickr data file does not include an index column by default. This column was added during the import process, but its inclusion is optional.  
</div>  

The following code imports the same dataframe as before, this time we set the first column, `p_id`, as the index. Keep in mind that the *index* must be unique. Alternativly, an index can be a string, such as a name.  

In [11]:
df1 = pd.read_csv(path, sep = ",", index_col = "p_id")

df1

Unnamed: 0_level_0,lat,lon,o_id,p_date,accuracy,title,tags,url
p_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
214028,51.504265,-0.078659,51035570500@N01,2003-05-24 17:40:38,16,"City Hall, London",london england normanfoster architecture geota...,https://live.staticflickr.com/1/214028_a96ba3e...
436102,51.505937,-0.082182,40385587@N00,2004-03-16 14:02:05,14,Angel. A Figurehead,london thames ship olympus c740 figurehead wys...,https://live.staticflickr.com/1/436102_7907e51...
436113,51.505323,-0.089650,40385587@N00,2004-03-16 14:04:28,14,Sign at London Bridge Station,signs london olympus lee c740 wysiwyg,https://live.staticflickr.com/1/436113_d87fc68...
459075,51.505062,-0.079017,98406434@N00,2004-06-27 00:00:00,16,London - City Hall looming,london towerbridge cityhall anyhoo photobyanyhoo,https://live.staticflickr.com/1/459075_44a6d3d...
459077,51.505062,-0.079017,98406434@N00,2004-06-27 00:00:00,16,London - City Hall reflection,reflection london thames cityhall anyhoo photo...,https://live.staticflickr.com/1/459077_74f6ecc...
...,...,...,...,...,...,...,...,...
53282639969,51.505597,-0.083953,35234357@N04,2023-06-20 17:23:41,16,"Hays Galleria, London, 20th June 2023",20thjune june2023 london haysgalleria haysgall...,https://live.staticflickr.com/65535/5328263996...
53282751605,51.504722,-0.081667,35234357@N04,2023-06-20 17:29:54,16,"Apart Together by Olivia Hylton, Part of the M...",20thjune june2023 london morph morphsculptures...,https://live.staticflickr.com/65535/5328275160...
53283030331,51.505111,-0.089925,78364563@N00,2023-10-21 14:44:02,16,,,https://live.staticflickr.com/65535/5328303033...
53283300699,51.505944,-0.088692,138177073@N04,2023-06-13 19:27:27,16,London,london england unitedkingdom,https://live.staticflickr.com/65535/5328330069...


In [12]:
print(df1.loc[459077])

lat                                                 51.505062
lon                                                 -0.079017
o_id                                             98406434@N00
p_date                                    2004-06-27 00:00:00
accuracy                                                   16
title                           London - City Hall reflection
tags        reflection london thames cityhall anyhoo photo...
url         https://live.staticflickr.com/1/459077_74f6ecc...
Name: 459077, dtype: object


<div class="alert alert-block alert-info">
<b>Summary:</b>  

- .iloc works on the integer value of the row. That works always.
- .loc works on the label of the row. That works only if the row label is unique.
</div>  

In [13]:
# search for a specific entry like the Photo ID with the number "11869901163"

df.loc[df['p_id'] == 11869901163]


Unnamed: 0,p_id,lat,lon,o_id,p_date,accuracy,title,tags,url
49614,11869901163,51.505931,-0.078116,35234357@N04,2013-10-05 16:27:16,16,Barge on the Thames - Opposite the Tower of Lo...,london october2013 shipsandboats toweroflondon...,https://live.staticflickr.com/5486/11869901163...


## Descriptive statistics

[Descriptive statistics](https://pandas.pydata.org/docs/user_guide/basics.html#descriptive-statistics)


In [None]:
# print(df["lat"].sum())

print(df["lat"].describe())



## Data Types and Conversion

In [None]:
# What are the datatypes in the table?
df.dtypes

In [None]:
# Convert one column from int to string
df = df.astype({'p_id': 'string'})
df.dtypes

The trickiest part, which occurs frequently, involves managing dates and times. In its original form, a string like `2013-10-05 16:27:16` isn't automatically identified, therefore we have to convert it into a `datetime` object.

`pd.to_datetime` is the command, the documentation is [here](https://pandas.pydata.org/docs/user_guide/timeseries.html).

As you can see from the example below, the data is not always clean and the best is to delete it with `.drop() `

In [22]:
# Convert to 'p_date' to 'datetime' in order to perform date operations

# pd.to_datetime(df['p_date'])

# df.iloc[53556]

# df = df.drop(53556)

df['p_date'] = pd.to_datetime(df['p_date'])

df.dtypes

p_id        string[python]
lat                float64
lon                float64
o_id                object
p_date      datetime64[ns]
accuracy             int64
title               object
tags                object
url                 object
dtype: object

## Filter out data

Now we can extract datapoints by date ... 



In [24]:
filtered_df = df.query("p_date >= '2004-03-01 00:00:00' and p_date < '2004-04-01 00:00:00'")

filtered_df = filtered_df.sort_values(by='p_date', ascending=True)

filtered_df

Unnamed: 0,p_id,lat,lon,o_id,p_date,accuracy,title,tags,url
12144,3532394992,51.505083,-0.078449,37490392@N07,2004-03-01 22:01:14,14,Famous Landmark,bridge london architecture towerbridge archite...,https://live.staticflickr.com/3391/3532394992_...
87673,52805276323,51.505902,-0.083437,95012874@N00,2004-03-02 20:12:07,16,Hay's Galleria,minibreak minoltadynax5 2004 35mm film slr lon...,https://live.staticflickr.com/65535/5280527632...
87671,52805242405,51.505902,-0.083437,95012874@N00,2004-03-02 20:12:07,16,"Hay's Galleria, The Navigators",minibreak minoltadynax5 2004 35mm film slr mar...,https://live.staticflickr.com/65535/5280524240...
87670,52805240045,51.505902,-0.083437,95012874@N00,2004-03-02 20:12:07,16,"Hay's Galleria, The Navigators",minibreak minoltadynax5 2004 35mm film slr lon...,https://live.staticflickr.com/65535/5280524004...
87669,52805235780,51.505902,-0.083437,95012874@N00,2004-03-02 20:12:07,16,Hay's Galleria,spring minoltadynax5 2004 35mm epsonperfection...,https://live.staticflickr.com/65535/5280523578...
87672,52805245090,51.505902,-0.083437,95012874@N00,2004-03-02 20:12:07,16,"Hay's Galleria, The Navigators",minibreak minoltadynax5 2004 35mm epsonperfect...,https://live.staticflickr.com/65535/5280524509...
87665,52804844331,51.505902,-0.083437,95012874@N00,2004-03-02 20:12:07,16,"Hay's Galleria, The Navigators",minibreak minoltadynax5 2004 35mm film slr mar...,https://live.staticflickr.com/65535/5280484433...
3161,538159185,51.504956,-0.079135,19243288@N00,2004-03-10 13:45:50,16,London City Hall,england london tower thames river hall europe ...,https://live.staticflickr.com/1204/538159185_5...
6742,2395984590,51.504441,-0.07802,24404258@N04,2004-03-13 15:33:08,14,London,england london,https://live.staticflickr.com/2409/2395984590_...
6734,2395154053,51.504441,-0.07802,24404258@N04,2004-03-13 16:04:34,14,London - The Tower Bridge,london,https://live.staticflickr.com/3078/2395154053_...


## Save out Data

[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)

In [25]:
path = "Filtered.csv"


filtered_df.to_csv(path_or_buf= path, header= True)