# PANDAS

Pandas is a software library written for the Python programming language for **data manipulation and analysis**. In particular, it offers data structures and operations for manipulating numerical tables and time series.

In [None]:
import pandas as pd

## DataFrames 

The data structures that store data in pandas are the **Data Frames**.

In [None]:
df1 = pd.DataFrame([[2, 4, 6],  [10, 20, 30]])
df1

The 0, 1, 2 at top are the **column names** (you can name the columns yourself) and the 0 and 1 in the right are the indices (which also can be renamed)

In [None]:
df1 = pd.DataFrame([[2, 4, 6],  [10, 20, 30]], columns = ["Column1", "Columns2", "Column3"], index= ["Row1", "Row2"])
df1

See..... the names

Though the customized indices don't make much sense ... but hey! you can do it, if you want to (dunno why).

Another way of creating Data Frames is to use dictonaries instead of lists

In [None]:
df2 = pd.DataFrame([{"Last Name" : "Bond", "Name" : "James"}, {"Last Name" : "Stark", "Name" : "Tony"}, {"Last Name" : "Anderson", "Name" : "Thomas"},{"Last Name" : "Holmes", "Name" : "Sherlock"},{"Name" : "Neo"}])
print(type(df2))
df2

This also goes on to prove how idiotic it is to create data frames using these ways (manually). We mostly use externally created csv's or other files to import data for analysis and processing using pandas

Also, notice the data type of df2!

In [None]:
print("The column-wise mean of the data in first data frame")
print(df1.mean())
print("\nAnd the mean of entire data set is")
print(df1.mean().mean())

And this an (very, very, very leyman level) exampleof data analysis

## Series

In [None]:
type(df1.mean())

Well this is a pandas' **Series** object. Series have more or less the same methonds that we can apply on a data frame object.

Data Frames are made up of series. You'll get a gist once you see the below code and its result

In [None]:
print("Data type of the columns of a Data Frame is",type(df1.Column1))
print("Thus we can find their mean or do other (I don't know what) stuff on them.\nLike the mean of Column1 is", df1.Column1.mean(),"\nAnd the largest element of Column1 is", df1.Column1.max())

## Working with files/data

### CSV files

Working with **csv's** and other file formats

In [None]:
df1 = pd.read_csv("Dataset\supermarkets.csv")
df1

Wasn't that easy and simple!

### json files

In [None]:
df2 = pd.read_json("Dataset\supermarkets.json")
df2

### Exel files

Loading **Exel** files

In [None]:
df3 = pd.read_excel("Dataset\supermarkets.xlsx", sheet_name = 0)
df3

### Plain files

From **Plain txt** files (seperated by commas) we use *pandas.read_csv()* 

This is not technically a csv (comma seperated) but rather a *character seperated* and in case of comma we don't need to pass anyy **seperator** arguement but for other separators we need to pass the character that is used as a separator

In [None]:
df4 = pd.read_csv("Dataset\supermarkets-commas.txt")
df4

Below we can check what happens if we use something other than comma as a separator

In [None]:
df4 = pd.read_csv("Dataset\supermarkets-semi-colons.txt")
df4

Well, that came out crappy. I didn't recognize the separator.

So what do we do

In [None]:
df5 = pd.read_csv("Dataset\supermarkets-semi-colons.txt", sep = ';')
df5

It worked!

### Headers 

Setting table **Header Row**

When data is imported in the code, the first row of the data is treated as the header row by default and so if data is missing header row it will create something like this

In [None]:
df6 = pd.read_csv("Dataset\supermarkets_noH.csv")
print(df6.sum())
df6

Well, what the heck

So we need to explicitly tell the code that the data lacks a header row by setting the **header** parameter as ***header = None***

In [None]:
df7 = pd.read_csv("Dataset\supermarkets_noH.csv", header = None)
print(df7.sum(axis = 1))    #row-wise
print(df7.sum(axis = 0))    #column-wise
df7

So this tells the code that we don't havfe a header so it will give a generic header which will be made up of number indices.

We can initialize the column and row names as we did earlier or like this

### Naming columns

In [None]:
df7.columns = ["ID", "Address", "City", "ZIP", "Country", "Name", "Employees"]
df7

We may want to make a given attribute of the data as the column index, like *ID* in this above data frame.

We can do this using the ***dataframe.set_index("name")*** method which *returns a **new*** dataframe 

In [None]:
df7.set_index("ID")

the initial df7 is still unchanged => We need to save the output of the set_index function in some var or we need to se the **inplace** parameter of the set_index function to **True**

In [None]:
df8 = df7.set_index("ID")
df7

### **But a word of caution**

In [None]:
df8.set_index("Address", inplace= True)

Now when we change the index (here we set it to Address), the new index will be set but the old index is not reassigned as a column rather is dropped or deleted

In [None]:
df8

What we can do to tackle it is set **drop** parameter to ***False*** so that the attribute we set as index will also remain as a column so that later on if we change the index we'll still have the data that was earlier used for indexing

In [None]:
df8.set_index("Name", inplace= True, drop= False)

See that Name has become an index but it will also remain as a column

In [None]:
df8

## Filtering Data

Indexing and extracting data

We'll use df7 data frame to work upon

In [None]:
df7

In [None]:
data = df7.set_index("Address", drop=True)
data

We can use *label based* or *position based* indexing

**Lable based indexing** is when we use the row and column indices for addressing them

In [None]:
data.loc["735 Dolores St" : "3995 23rd St", "Country" : "Name"]

And specific elements using

In [None]:
data.loc["332 Hill St", "Country"]

And for all the columns

In [None]:
list(data.loc[:, "Country"])   #And the list function of python will convert it in a list

**Position based indexing** obviously the much better (real) way

In [None]:
data

In [None]:
data.iloc[1:3, 1:3]   #and as usual it is upper-bound exclusive

In [None]:
data.iloc[:, 1:3]

In [None]:
data.iloc[3, 1:3]

We can also get rid of columns and rows of data frames (though these operations are not inplace, ie, they will create a new instance rather than modifying the object on which they were invoked)

In [None]:
data

### Deleting elements (dataframe.drop())

The **1** in the arguement of ***dataframe.drop()*** function tells that we want to delete the column and 0 implies we intend to delete rows

In [None]:
data.drop("City", 1) # 

**0** removes rows

In [None]:
data.drop("332 Hill St", 0)

To drop columns or rows based on indexing, we do a trick

Example for rows is below

In [None]:
data.drop(data.index[0:3], 0)

Similarly for columns

In [None]:
data.drop(data.columns[0:3], 1)

***dataframe.index*** returns a list of names of all the index columns and ***dataframe.columns*** returns a series of names of all the names of columns

In [None]:
print(data.index,"\n")
print(data.columns)

## Updating and adding **Series**

### Adding Columns

!When adding new column of data to a data frame, we need to declare the column name and then assign a list of data values corresponding to each row (=>number of values in the list should be equal to the number of rows in the data frame)

In [None]:
data["Continent"] = ["North America"] * data.shape[0]    #Or just multiply by len(data.index)
data

### Modifying a column

In [None]:
data["Continent"] = data["Country"] + ", " + "North America"
data

### Adding a row

This ain't veryy easy as there isn't a simple way of passing a row to a data frame. So one way to doing this is to **transpose** the data frame and then adding a column like we did previously and then transpose it back

In [None]:
data_temp = data.T
data_temp

In [None]:
data_temp["My Address"] = [7, "My City", "Myy ZIP code", "My Country", "My Name", 67, "My Continent"]
data = data_temp.T
data

Aaaaand we successfully added a row to our data frame

And to update the data of a row we can do similar thing

### Modifying rows

In [None]:
data = data.T
data["My Address"] = [7, "Varanasi", "VNS 221003", "India", "Hattori", "1", "Asia"]
data = data.T
data

# Example of Data Analysis (Very very basic)

Some tid bids:  The process of converting addresses into latitudes and longitudes is called **Geocoding** and the process of converting latitude and longitude info of a place into address is called **Reverse geocoding**

In [None]:
from geopy.geocoders import ArcGIS
nom = ArcGIS()

In [None]:
n = nom.geocode("Nav Sadhana Kendra, Varanasi, 221003, Uttar Pradesh")
print(type(n))
n

Oh WoW! it works

Don't worry about the 0.0 in the end. It's just a response from the geocoder and doesn't mean much. Sometimes it may give a None object in case the location is incorrect (oooorr... its top secret).

Also also, isn't the type of location is interesting, it is an especial object, location object of geopy. Let's extract the latitude and longitude data from it

In [None]:
print(n.latitude, n.longitude)

Now lets convert an entire dataframe of addresses into latitudes and longitudes and then add two columns of the extracted information to the data frame

In [None]:
data = pd.read_csv("Dataset\supermarkets.csv")
data

Well the geocoder expects from us a string as an input usually of form "road, city, zip code, country". Basically an address string and not a dataframe. So, lets modify the address column to meet our requirements and then pass the data to the geocoder

In [None]:
data["Address"] = data["Address"] + ", " + data["City"] + ", " + data["State"] + ", " + data["Country"]
data

## The **dataframe.apply** function

You might be thinking of iterating, but with pandas you don't need to do that as pandas has some functions that allows you to apply a method to all the rows of a dataframe. To do that we need to create a new column/series and then assign it the output of data["column to apply on"].**apply(function name)**

In [None]:
data["Coordinates"] = data["Address"].apply(nom.geocode)
data

Now the coordinates column of the dataframe contains the location objects corresponding to the addresses 

Now to add two more columns to save the latitude and the longitudes of the data points separately we'll use **apply** method in conj7uction with a lambda expression

In [None]:
data["Latitude"] = data["Coordinates"].apply(lambda x : x.latitude if x != None else None)  # added a conditional just to be sure that the scrip doesn't crash in case someone played a trick with us or made an honest mistake 
data["Longitude"] = data["Coordinates"].apply(lambda x : x.longitude if x != None else None)
data

# Numerical and Scientific Computing with Numpy

## What is Numpy?

NumPy is the fundamental package for scientific computing in Python. NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. 

In [None]:
import numpy as np

## The numpy object: *ndarray*
At the core of the NumPy package, is the **ndarray** object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance. 

**Axis or the dimensions are referred to as 0 or 1**
- 0 -> horizontal axis
- 1 -> vertical axis

### ndarray of single dimension: **numpy.arrange**

In [None]:
n = np.arange(27)
n

This isn't exactly a multi-dimensional array as it only has a single dimension but it is a **numpy.ndarray** 

In [None]:
type(n)

### Creating a real multidimensional array by **reshape**ing existing $1$-dim array

In [None]:
n.reshape(3, 9)

In [None]:
n.reshape(3, 3, 3)

### Creating numpy arrays using existing python sequences: **numpy.asarray()**

In [None]:
lis = [[1, 2, 3, 4, 5], [11, 22, 33, 44, 55], [111, 222, 333, 444, 555]]
m = np.asarray(list)
m

## Converting images to Numpy Arrays
We'll use the **OpenCV** package, also referred to as **cv2**, for doing this.

In [None]:
import cv2

the **cv2.imread** function expects two args for loading an image.
- The image itself(the path to the image to be precise)
- And *int* indicating whether it is to be loaded as a **grayscales** $(0)$ or as **BGR** $(1)$ (BGR in the exact same sequence, first layer has blue, 2nd green and 3rd red).

In [None]:
img_grey = cv2.imread("Dataset\smallgray.png", 0)
img_grey

See the grayscale intensities in the 2-D array. The white (most intense) are 255 and 0 -> Black

In [None]:
img_bgr = cv2.imread("Dataset\smallgray.png", 1)
img_bgr

This is the 3-D array of 
- Blue Layer
- Green Layer
- Red Layer

**!!!** One thing to remember is that the **layers are transposed**, ie, the rows are vertical and the columns are horizontal, and thats why each layer is $(5$ X $3)$ instead of $(3$ x $5)$ as it should've been.

### Conveting Numpy arrays to images: **cv2.imwrite**

In [None]:
cv2.imwrite("new_smallgray.png", img_grey)

## Addressing elements in ndarray

### Indexing and slicing Numpy Arrays

Indexing and slicing are basically the same as we do with python lists

In [None]:
n = img_grey
n

In [None]:
n[0:2]

In [None]:
n[0:2, 2:4]

In [None]:
n[2:, 2:4]

Lets check out the shape

In [None]:
n.shape

### Iteration 
There are two ways to iterate over a ndarray
- Row-wise or column-wise
- Element-wise

**Row-wise**

In [None]:
for i in n:
    print(i)

**Column-wise**

In [None]:
for i in n.T:
    print(i)

### **flat**tening a ndarray: n-D to 1-D
We can also convert a ndarray into a single dimensional array which can be used to iterate over the array **element-wise**

In [None]:
for i in n.flat:
    print(i)

## Stacking and Splitting Numpy Arrays

### Stacking/Concatenating
There are (obviously) two ways of stacking ndarrays
- **hstack** will stack the 2 or more arrays horizontally
- **vstack** will stack them vertically

**!!!** Though we're concatenating multiple ndarrays these stack methods only take one argument. It takes a **tuple** of all the ndarrays to be concatenated.

Also, the ndarrays to be concatenated must have **same dimensions**.

In [None]:
n_hcat = np.hstack((n, n))
n_hcat

In [None]:
n_vcat = np.vstack((n, n, n))
n_vcat

### Splitting
This too can be done in two ways, horizontally (**hsplit**) and vertically (**vsplit**).

It takes a ndarray (the one to be split) and an integer $k$ and then divides the ndarray in $k$ equal ndarrays, horizontally or vertically (as per specified).

In [None]:
n_hsplit = np.hsplit(n, 5)
n_hsplit

In [None]:
n_vsplit = np.vsplit(n_vcat, 3)
n_vsplit

Also, notice the type of data these split methods are returning

In [None]:
type(n_vsplit)

So **split** functions return a **python list of numpy arrays**, this implies that we can access each ndarray as we do with any python list

In [None]:
n_vsplit[2]

# Preparation for the Finals


## Folium 
is a Python library used for visualizing geospatial data.

 Folium is a Python wrapper for Leaflet. js which is a leading open-source JavaScript library for plotting interactive maps. Thus whatever we write in python will automatically be converted into js, html and css code (as they are where the **Web Map** will run)

In [None]:
import folium

### Map Objects
Everything in *folium* spins around its data-structure the **Map Object**. So to start working, the first thing we need to do is to create a map object using
#### folium.Map
Map is the class that creates the map object 

In [None]:
map = folium.Map(location= [38.58, -99.09])
map

Now we have a map object, but we haven't converted it into a usable format
## Converting map object to html

In [None]:
map.save("Dataset\OutputMap1.html")

We can also play around with the zoom and webmap tiles

In [None]:
map2 = folium.Map(location= [38.58, -99.09], zoom_start= 6, tiles = "Stamen Terrain")
map2

In [None]:
map2.save("Dataset\OutputMap2.html")

## Adding point markers and layerss to a map
### **Marker**s
Before saving the map objects we can add *elements* (objects) to it.

**map.add_child** function  adds various elements (called children) to the map 

**Marker**s allows us to add pop-ups and requires us to providde it a location tuple

In [None]:
map2.add_child(folium.Marker(location= [38.2, -99.1], popup= "I'm a pop-up", icon = folium.Icon(color= "green")))
map2

But this isn't the best way of doing this.

Rather we'd create a **folium.FeatureGroup** and add all the elements in that group and then just add the feature group to the map. This allows us to add elements more efficiently and in an organized manner. And also keeps the code itself more organized bla bla bla

**Feature Groups** also allow for creating layers that can be turned on and off as per requirement

In [None]:
map3 = folium.Map(location= [38.58, -99.09], zoom_start= 6, tiles = "Stamen Terrain")
feat_grp = folium.FeatureGroup(name = "Markes")
feat_grp.add_child(folium.Marker(location= [38.2, -99.1], popup= "Hi!, I'm a red popup", icon= folium.Icon(color= "red")))
feat_grp.add_child(folium.Marker(location= [39.2, -99.1], popup= "Hi!, I'm a yellow popup", icon= folium.Icon(color= "blue")))

In [None]:
map3.add_child(feat_grp)

All of this implies that for **adding multiple elements** all we need to do is to add all the elements in a common *Feature Group* like we do with any other collection by iterating over all the elements and then add that Feature Group to the *Map Object*

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv("Dataset\Volcanoes.txt", sep= ",")   #No need for the sep arg, I added it just because
data.head()

So now that we have our test data loaded we can see what all it contains and come to the conclusion that we only need the *LAT* and the *LON* attributes along with the *names* (maybe).

In [None]:
Map_multi_mark = folium.Map([42.081797, -103.696771], zoom_start= 4, tiles = "Stamen Terrain")
Map_multi_mark

In [None]:
lat = list(data["LAT"])
lon = list()(data["LON"])
name = list(data["NAME"])
ele = list(data["ELEV"])

Now we have the latitude, longitude and the names in native python lists

In [None]:
def give_markers_grp():
    Feat_grp = folium.FeatureGroup(name= "Volcanoes Markers")
    for la, lo, na in zip(lat, lon, name):
        Feat_grp.add_child(folium.Marker(location= [la,lo], popup= na, icon= folium.Icon(color= "red")))
    return Feat_grp

In [None]:
Map_multi_mark.add_child(give_markers_grp())
Map_multi_mark.save("Dataset\OutputVolcanoes.html")
Map_multi_mark

## Adding HTML on Popups
Note that if you want to have stylized text (bold, different fonts, etc) in the popup window you can use HTML. Here's an example:

In [None]:
data = pd.read_csv("Dataset\Volcanoes.txt")
lat = list(data["LAT"])
lon = list(data["LON"])
elev = list(data["ELEV"])
 
html = """<h4>Volcano information:</h4>
Height: %s m
"""
 
map_ex = folium.Map(location=[38.58, -99.09], zoom_start=5, tiles = "Stamen Terrain")
fg = folium.FeatureGroup(name = "My Map")
 
for lt, ln, el in zip(lat, lon, elev):
    iframe = folium.IFrame(html=html % str(el), width=200, height=100)
    fg.add_child(folium.Marker(location=[lt, ln], popup=folium.Popup(iframe), icon = folium.Icon(color = "green")))

map_ex.add_child(fg)
map_ex.save("Dataset\OutputMap_html_popup_simple.html")
map_ex

You can even put **links** in the popup window. For example, the code below will produce a popup window with the name of the volcano as a link which does a Google search for that particular volcano when clicked:

In [None]:
data = pd.read_csv("Dataset\Volcanoes.txt")
lat = list(data["LAT"])
lon = list(data["LON"])
elev = list(data["ELEV"])
name = list(data["NAME"])
 
html = """
Volcano name:<br>
<a href="https://www.google.com/search?q=%%22%s%%22" target="_blank">%s</a><br>
Height: %s m
"""
 
map = folium.Map(location=[38.58, -99.09], zoom_start=5, tiles = "Stamen Terrain")
fg = folium.FeatureGroup(name = "My Map")
 
for lt, ln, el, name in zip(lat, lon, elev, name):
    iframe = folium.IFrame(html=html % (name, name, el), width=200, height=100)
    fg.add_child(folium.Marker(location=[lt, ln], popup=folium.Popup(iframe), icon = folium.Icon(color = "green")))
 
map.add_child(fg)
map.save("Dataset\OutputMap_html_popup_advanced.html")
map

Just for the sake of it, lets generate color based on elevation

In [None]:
def color_elev(e):
    if e < 1000:
        return 'green'
    elif 1000 <= e < 3000:
        return 'orange'
    else:
        return 'red'

Now lets get markers looking all nice and good

In [None]:
map_cirly = folium.Map(location=[38.58, -99.09], zoom_start=5, tiles = "Stamen Terrain")


In [None]:

def give_markers_grp_2():
    Feat_grp = folium.FeatureGroup(name= "Volcanoes Markers")
    for la, lo, na, e in zip(lat, lon, name, ele):
        Feat_grp.add_child(folium.CircleMarker(location= [la,lo], radius= 6, popup= na, fill_color = color_elev(e), color = 'grey', fill_opacity = 0.7))
    return Feat_grp

In [None]:
map_cirly.add_child(give_markers_grp_2())
map_cirly

## Map Layers
Currently we have two layers in our *map object* the base map (served to us by OpenStreetMap) and the markers layer. In GIS there are several types of layers, like the point layer (that we have), polygon layer (that we'll add) and line layer (that we won't add), each with their own use-cases.

To add **Polygon layer** via folium we use **folium.GeoJson()** that creates a GeoJson object (GeoJson is a special case of json), which we'll pass to the map as an element as we've done with earlier layers.

In [None]:
feature_grp = give_markers_grp_2()
feature_grp.add_child(folium.GeoJson(data = open("Dataset\world.json", 'r', encoding= 'utf-8-sig').read()))

map_poly = folium.Map(location= [42.081797, -103.696771], zoom_start= 5, tiles = "Stamen Terrain")
map_poly.add_child(feature_grp)

Those polygons on the base map demarcating the countries is the third layer that we added.

Let's now stylize the polygon layer based on the population of the countries *(POP2005)* to obtain something similar to a *population layer*

For this we need to pass another arguement to the GeoJason

In [None]:
feature_grp = give_markers_grp_2()
feature_grp.add_child(folium.GeoJson(data = open("Dataset\world.json", 'r', encoding= 'utf-8-sig').read(),
style_function = lambda x : {'fillColor' : 'yellow'}))

map_im_out_of_names = folium.Map(location= [42.081797, -103.696771], zoom_start= 5, tiles = "Stamen Terrain")
map_im_out_of_names.add_child(feature_grp)

In [None]:
feature_grp = give_markers_grp_2()
feature_grp.add_child(folium.GeoJson(data = open("Dataset\world.json", 'r', encoding= 'utf-8-sig').read(),
style_function = lambda x : {'fillColor' : 'green' if x['properties']['POP2005'] < 10000000 
else 'orange' if x['properties']['POP2005'] < 20000000 else 'red'}))


In [None]:
map_im_out_of_names.add_child(feature_grp)

### Adding layer control panel 
The key thing here is the **folium.LayerControl** object that we must add as an element to our map object.

**!!!** One very important thing is that the *LayerControl* looks for the *feature groups* added to the *map object* so it must be added to the *map object* after we've already added all the layers (that are to be controled) to our map.

In [None]:
map_final__promis = folium.Map(location= [42.081797, -103.696771], zoom_control= 6, tiles = "Stamen Terrain")
feature_grp_final1 = give_markers_grp_2()
feature_grp_final12 = folium.FeatureGroup(name = "Population Map")
feature_grp_final12.add_child(folium.GeoJson(data = open("Dataset\world.json", 'r', encoding= 'utf-8-sig').read(),
style_function = lambda x : {'fillColor' : 'green' if x['properties']['POP2005'] < 10000000 
else 'orange' if x['properties']['POP2005'] < 20000000 else 'red'}))
map_final__promis.add_child(feature_grp_final1)
map_final__promis.add_child(feature_grp_final12)
map_final__promis.add_child(folium.LayerControl())
map_final__promis.save("Dataset\OutputFinal.html")