**PANDAS**

Pandas is a software library written for the Python programming language for **data manipulation and analysis**. In particular, it offers data structures and operations for manipulating numerical tables and time series.

In [61]:
import pandas as pd

The data structures that store data in pandas are the **Data Frames**.

In [62]:
df1 = pd.DataFrame([[2, 4, 6],  [10, 20, 30]])
df1

Unnamed: 0,0,1,2
0,2,4,6
1,10,20,30


The 0, 1, 2 at top are the **column names** (you can name the columns yourself) and the 0 and 1 in the right are the indices (which also can be renamed)

In [63]:
df1 = pd.DataFrame([[2, 4, 6],  [10, 20, 30]], columns = ["Column1", "Columns2", "Column3"], index= ["Row1", "Row2"])
df1

Unnamed: 0,Column1,Columns2,Column3
Row1,2,4,6
Row2,10,20,30


See..... the names

Though the customized indices don't make much sense ... but hey! you can do it, if you want to (dunno why).

Another way of creating Data Frames is to use dictonaries instead of lists

In [64]:
df2 = pd.DataFrame([{"Last Name" : "Bond", "Name" : "James"}, {"Last Name" : "Stark", "Name" : "Tony"}, {"Last Name" : "Anderson", "Name" : "Thomas"},{"Last Name" : "Holmes", "Name" : "Sherlock"},{"Name" : "Neo"}])
print(type(df2))
df2

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Last Name,Name
0,Bond,James
1,Stark,Tony
2,Anderson,Thomas
3,Holmes,Sherlock
4,,Neo


This also goes on to prove how idiotic it is to create data frames using these ways (manually). We mostly use externally created csv's or other files to import data for analysis and processing using pandas

Also, notice the data type of df2!

In [65]:
print("The column-wise mean of the data in first data frame")
print(df1.mean())
print("\nAnd the mean of entire data set is")
print(df1.mean().mean())

The column-wise mean of the data in first data frame
Column1      6.0
Columns2    12.0
Column3     18.0
dtype: float64

And the mean of entire data set is
12.0


And this an (very, very, very leyman level) exampleof data analysis

In [66]:
type(df1.mean())

pandas.core.series.Series

Well this is a pandas' **Series** object. Series have more or less the same methonds that we can apply on a data frame object.

Data Frames are made up of series. You'll get a gist once you see the below code and its result

In [67]:
print("Data type of the columns of a Data Frame is",type(df1.Column1))
print("Thus we can find their mean or do other (I don't know what) stuff on them.\nLike the mean of Column1 is", df1.Column1.mean(),"\nAnd the largest element of Column1 is", df1.Column1.max())

Data type of the columns of a Data Frame is <class 'pandas.core.series.Series'>
Thus we can find their mean or do other (I don't know what) stuff on them.
Like the mean of Column1 is 6.0 
And the largest element of Column1 is 10


Working with **csv's** and other file formats

In [68]:
df1 = pd.read_csv("Dataset\supermarkets.csv")
df1

Unnamed: 0,ID,Address,City,State,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


Wasn't that easy and simple!

In [69]:
df2 = pd.read_json("Dataset\supermarkets.json")
df2

Unnamed: 0,ID,Address,City,State,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


Loading **Exel** files

In [70]:
df3 = pd.read_excel("Dataset\supermarkets.xlsx", sheet_name = 0)
df3

Unnamed: 0,ID,Address,City,State,Country,Supermarket Name,Number of Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


From **Plain txt** files (seperated by commas) we use *pandas.read_csv()* 

This is not technically a csv (comma seperated) but rather a *character seperated* and in case of comma we don't need to pass anyy **seperator** arguement but for other separators we need to pass the character that is used as a separator

In [71]:
df4 = pd.read_csv("Dataset\supermarkets-commas.txt")
df4

Unnamed: 0,ID,Address,City,State,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


Below we can check what happens if we use something other than comma as a separator

In [72]:
df4 = pd.read_csv("Dataset\supermarkets-semi-colons.txt")
df4

Unnamed: 0,ID;Address;City;State;Country;Name;Employees
0,1;3666 21st St;San Francisco;CA 94114;USA;Made...
1,2;735 Dolores St;San Francisco;CA 94119;USA;Br...
2,3;332 Hill St;San Francisco;California 94114;U...
3,4;3995 23rd St;San Francisco;CA 94114;USA;Ben'...
4,5;1056 Sanchez St;San Francisco;California;USA...
5,6;551 Alvarado St;San Francisco;CA 94114;USA;R...


Well, that came out crappy. I didn't recognize the separator.

So what do we do

In [73]:
df5 = pd.read_csv("Dataset\supermarkets-semi-colons.txt", sep = ';')
df5

Unnamed: 0,ID,Address,City,State,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


It worked!

Setting table **Header Row**

When data is imported in the code, the first row of the data is treated as the header row by default and so if data is missing header row it will create something like this

In [74]:
df6 = pd.read_csv("Dataset\supermarkets_noH.csv")
print(df6.sum())
df6

1                                                               20
3666 21st St     735 Dolores St332 Hill St3995 23rd St1056 Sanc...
San Francisco    San FranciscoSan FranciscoSan FranciscoSan Fra...
CA 94114         CA 94119California 94114CA 94114CaliforniaCA 9...
USA                                                USAUSAUSAUSAUSA
Madeira          Bready ShopSuper RiverBen's ShopSanchezRichvalley
8                                                               82
dtype: object


Unnamed: 0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
0,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
1,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
2,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
3,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
4,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


Well, what the heck

So we need to explicitly tell the code that pur data lacks a header row byy setting the **header** parameter as ***header = None***

In [75]:
df7 = pd.read_csv("Dataset\supermarkets_noH.csv", header = None)
print(df7.sum(axis = 1))    #row-wise
print(df7.sum(axis = 0))    #column-wise
df7

0     9
1    17
2    28
3    14
4    17
5    26
dtype: int64
0                                                   21
1    3666 21st St735 Dolores St332 Hill St3995 23rd...
2    San FranciscoSan FranciscoSan FranciscoSan Fra...
3    CA 94114CA 94119California 94114CA 94114Califo...
4                                   USAUSAUSAUSAUSAUSA
5    MadeiraBready ShopSuper RiverBen's ShopSanchez...
6                                                   90
dtype: object


Unnamed: 0,0,1,2,3,4,5,6
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


So this tells the code that we don't havfe a header so it will give a generic header which will be made up of number indices.

We can initialize the column and row names as we did earlier or like this

In [76]:
df7.columns = ["ID", "Address", "City", "ZIP", "Country", "Name", "Employees"]
df7

Unnamed: 0,ID,Address,City,ZIP,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


We may want to make a given attribute of the data as the column index, like *ID* in this above data frame.

We can do this using the ***dataframe.set_index("name")*** method which *returns a **new*** dataframe 

In [77]:
df7.set_index("ID")

Unnamed: 0_level_0,Address,City,ZIP,Country,Name,Employees
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
3,332 Hill St,San Francisco,California 94114,USA,Super River,25
4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


the initial df7 is still unchanged => We need to save the output of the set_index function in some var or we need to se the **inplace** parameter of the set_index function to **True**

In [78]:
df8 = df7.set_index("ID")
df7

Unnamed: 0,ID,Address,City,ZIP,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


**But a word of caution**

In [79]:
df8.set_index("Address", inplace= True)

Now when we change the index (here we set it to Address), the new index will be set but the old index is not reassigned as a column rather is dropped or deleted

In [80]:
df8

Unnamed: 0_level_0,City,ZIP,Country,Name,Employees
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3666 21st St,San Francisco,CA 94114,USA,Madeira,8
735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
332 Hill St,San Francisco,California 94114,USA,Super River,25
3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
1056 Sanchez St,San Francisco,California,USA,Sanchez,12
551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


What we can do to tackle it is set **drop** parameter to ***False*** so that the attribute we set as index will also remain as a column so that later on if we change the index we'll still have the data that was earlier used for indexing

In [81]:
df8.set_index("Name", inplace= True, drop= False)

See that Name has become an index but it will also remain as a column

In [82]:
df8

Unnamed: 0_level_0,City,ZIP,Country,Name,Employees
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Madeira,San Francisco,CA 94114,USA,Madeira,8
Bready Shop,San Francisco,CA 94119,USA,Bready Shop,15
Super River,San Francisco,California 94114,USA,Super River,25
Ben's Shop,San Francisco,CA 94114,USA,Ben's Shop,10
Sanchez,San Francisco,California,USA,Sanchez,12
Richvalley,San Francisco,CA 94114,USA,Richvalley,20


**FILTERING DATA**  Indexing and extracting data

We'll use df7 data frame to work upon

In [83]:
df7

Unnamed: 0,ID,Address,City,ZIP,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


In [84]:
data = df7.set_index("Address", drop=True)
data

Unnamed: 0_level_0,ID,City,ZIP,Country,Name,Employees
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3666 21st St,1,San Francisco,CA 94114,USA,Madeira,8
735 Dolores St,2,San Francisco,CA 94119,USA,Bready Shop,15
332 Hill St,3,San Francisco,California 94114,USA,Super River,25
3995 23rd St,4,San Francisco,CA 94114,USA,Ben's Shop,10
1056 Sanchez St,5,San Francisco,California,USA,Sanchez,12
551 Alvarado St,6,San Francisco,CA 94114,USA,Richvalley,20


We can use *label based* or *position based* indexing

**Lable based indexing** is when we use the row and column indices for addressing them

In [85]:
data.loc["735 Dolores St" : "3995 23rd St", "Country" : "Name"]

Unnamed: 0_level_0,Country,Name
Address,Unnamed: 1_level_1,Unnamed: 2_level_1
735 Dolores St,USA,Bready Shop
332 Hill St,USA,Super River
3995 23rd St,USA,Ben's Shop


And specific elements using

In [86]:
data.loc["332 Hill St", "Country"]

'USA'

And for all the columns

In [87]:
list(data.loc[:, "Country"])   #And the list function of python will convert it in a list

['USA', 'USA', 'USA', 'USA', 'USA', 'USA']

**Position based indexing** obviously the much better (real) way

In [88]:
data

Unnamed: 0_level_0,ID,City,ZIP,Country,Name,Employees
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3666 21st St,1,San Francisco,CA 94114,USA,Madeira,8
735 Dolores St,2,San Francisco,CA 94119,USA,Bready Shop,15
332 Hill St,3,San Francisco,California 94114,USA,Super River,25
3995 23rd St,4,San Francisco,CA 94114,USA,Ben's Shop,10
1056 Sanchez St,5,San Francisco,California,USA,Sanchez,12
551 Alvarado St,6,San Francisco,CA 94114,USA,Richvalley,20


In [89]:
data.iloc[1:3, 1:3]   #and as usual it is upper-bound exclusive

Unnamed: 0_level_0,City,ZIP
Address,Unnamed: 1_level_1,Unnamed: 2_level_1
735 Dolores St,San Francisco,CA 94119
332 Hill St,San Francisco,California 94114


In [90]:
data.iloc[:, 1:3]

Unnamed: 0_level_0,City,ZIP
Address,Unnamed: 1_level_1,Unnamed: 2_level_1
3666 21st St,San Francisco,CA 94114
735 Dolores St,San Francisco,CA 94119
332 Hill St,San Francisco,California 94114
3995 23rd St,San Francisco,CA 94114
1056 Sanchez St,San Francisco,California
551 Alvarado St,San Francisco,CA 94114


In [91]:
data.iloc[3, 1:3]

City    San Francisco
ZIP          CA 94114
Name: 3995 23rd St, dtype: object

We can also get rid of columns and rows of data frames (though these operations are not inplace, ie, they will create a new instance rather than modifying the object on which they were invoked)

In [92]:
data

Unnamed: 0_level_0,ID,City,ZIP,Country,Name,Employees
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3666 21st St,1,San Francisco,CA 94114,USA,Madeira,8
735 Dolores St,2,San Francisco,CA 94119,USA,Bready Shop,15
332 Hill St,3,San Francisco,California 94114,USA,Super River,25
3995 23rd St,4,San Francisco,CA 94114,USA,Ben's Shop,10
1056 Sanchez St,5,San Francisco,California,USA,Sanchez,12
551 Alvarado St,6,San Francisco,CA 94114,USA,Richvalley,20


The **1** in the arguement of ***dataframe.drop()*** function tells that we want to delete the column and 0 implies we intend to delete rows

In [93]:
data.drop("City", 1) # 

Unnamed: 0_level_0,ID,ZIP,Country,Name,Employees
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3666 21st St,1,CA 94114,USA,Madeira,8
735 Dolores St,2,CA 94119,USA,Bready Shop,15
332 Hill St,3,California 94114,USA,Super River,25
3995 23rd St,4,CA 94114,USA,Ben's Shop,10
1056 Sanchez St,5,California,USA,Sanchez,12
551 Alvarado St,6,CA 94114,USA,Richvalley,20


**0** removes rows

In [94]:
data.drop("332 Hill St", 0)

Unnamed: 0_level_0,ID,City,ZIP,Country,Name,Employees
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3666 21st St,1,San Francisco,CA 94114,USA,Madeira,8
735 Dolores St,2,San Francisco,CA 94119,USA,Bready Shop,15
3995 23rd St,4,San Francisco,CA 94114,USA,Ben's Shop,10
1056 Sanchez St,5,San Francisco,California,USA,Sanchez,12
551 Alvarado St,6,San Francisco,CA 94114,USA,Richvalley,20


To drop columns or rows based on indexing, we do a trick

Example for rows is below

In [95]:
data.drop(data.index[0:3], 0)

Unnamed: 0_level_0,ID,City,ZIP,Country,Name,Employees
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3995 23rd St,4,San Francisco,CA 94114,USA,Ben's Shop,10
1056 Sanchez St,5,San Francisco,California,USA,Sanchez,12
551 Alvarado St,6,San Francisco,CA 94114,USA,Richvalley,20


Similarly for columns

In [96]:
data.drop(data.columns[0:3], 1)

Unnamed: 0_level_0,Country,Name,Employees
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3666 21st St,USA,Madeira,8
735 Dolores St,USA,Bready Shop,15
332 Hill St,USA,Super River,25
3995 23rd St,USA,Ben's Shop,10
1056 Sanchez St,USA,Sanchez,12
551 Alvarado St,USA,Richvalley,20


***dataframe.index*** returns a list of names of all the index columns and ***dataframe.columns*** returns a series of names of all the names of columns

In [97]:
print(data.index,"\n")
print(data.columns)

Index(['3666 21st St', '735 Dolores St', '332 Hill St', '3995 23rd St',
       '1056 Sanchez St', '551 Alvarado St'],
      dtype='object', name='Address') 

Index(['ID', 'City', 'ZIP', 'Country', 'Name', 'Employees'], dtype='object')


**Updating and adding** new columns and rows

!When adding new column of data to a data frame, we need to declare the column name and then assign a list of data values corresponding to each row (=>number of values in the list should be equal to the number of rows in the data frame)

In [98]:
data["Continent"] = ["North America"] * data.shape[0]    #Or just multiply by len(data.index)
data

Unnamed: 0_level_0,ID,City,ZIP,Country,Name,Employees,Continent
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3666 21st St,1,San Francisco,CA 94114,USA,Madeira,8,North America
735 Dolores St,2,San Francisco,CA 94119,USA,Bready Shop,15,North America
332 Hill St,3,San Francisco,California 94114,USA,Super River,25,North America
3995 23rd St,4,San Francisco,CA 94114,USA,Ben's Shop,10,North America
1056 Sanchez St,5,San Francisco,California,USA,Sanchez,12,North America
551 Alvarado St,6,San Francisco,CA 94114,USA,Richvalley,20,North America


Modifying a column

In [99]:
data["Continent"] = data["Country"] + ", " + "North America"
data

Unnamed: 0_level_0,ID,City,ZIP,Country,Name,Employees,Continent
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3666 21st St,1,San Francisco,CA 94114,USA,Madeira,8,"USA, North America"
735 Dolores St,2,San Francisco,CA 94119,USA,Bready Shop,15,"USA, North America"
332 Hill St,3,San Francisco,California 94114,USA,Super River,25,"USA, North America"
3995 23rd St,4,San Francisco,CA 94114,USA,Ben's Shop,10,"USA, North America"
1056 Sanchez St,5,San Francisco,California,USA,Sanchez,12,"USA, North America"
551 Alvarado St,6,San Francisco,CA 94114,USA,Richvalley,20,"USA, North America"


**Adding a row**

This ain't veryy easy as there isn't a simple way of passing a row to a data frame. So one way to doing this is to **transpose** the data frame and then adding a column like we did previously and then transpose it back

In [100]:
data_temp = data.T
data_temp

Address,3666 21st St,735 Dolores St,332 Hill St,3995 23rd St,1056 Sanchez St,551 Alvarado St
ID,1,2,3,4,5,6
City,San Francisco,San Francisco,San Francisco,San Francisco,San Francisco,San Francisco
ZIP,CA 94114,CA 94119,California 94114,CA 94114,California,CA 94114
Country,USA,USA,USA,USA,USA,USA
Name,Madeira,Bready Shop,Super River,Ben's Shop,Sanchez,Richvalley
Employees,8,15,25,10,12,20
Continent,"USA, North America","USA, North America","USA, North America","USA, North America","USA, North America","USA, North America"


In [101]:
data_temp["My Address"] = [7, "My City", "Myy ZIP code", "My Country", "My Name", 67, "My Continent"]
data = data_temp.T
data

Unnamed: 0_level_0,ID,City,ZIP,Country,Name,Employees,Continent
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3666 21st St,1,San Francisco,CA 94114,USA,Madeira,8,"USA, North America"
735 Dolores St,2,San Francisco,CA 94119,USA,Bready Shop,15,"USA, North America"
332 Hill St,3,San Francisco,California 94114,USA,Super River,25,"USA, North America"
3995 23rd St,4,San Francisco,CA 94114,USA,Ben's Shop,10,"USA, North America"
1056 Sanchez St,5,San Francisco,California,USA,Sanchez,12,"USA, North America"
551 Alvarado St,6,San Francisco,CA 94114,USA,Richvalley,20,"USA, North America"
My Address,7,My City,Myy ZIP code,My Country,My Name,67,My Continent


Aaaaand we successfully added a row to our data frame

And to update the data of a row we can do similar thing

In [102]:
data = data.T
data["My Address"] = [7, "Varanasi", "VNS 221003", "India", "Hattori", "1", "Asia"]
data = data.T
data

Unnamed: 0_level_0,ID,City,ZIP,Country,Name,Employees,Continent
Address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3666 21st St,1,San Francisco,CA 94114,USA,Madeira,8,"USA, North America"
735 Dolores St,2,San Francisco,CA 94119,USA,Bready Shop,15,"USA, North America"
332 Hill St,3,San Francisco,California 94114,USA,Super River,25,"USA, North America"
3995 23rd St,4,San Francisco,CA 94114,USA,Ben's Shop,10,"USA, North America"
1056 Sanchez St,5,San Francisco,California,USA,Sanchez,12,"USA, North America"
551 Alvarado St,6,San Francisco,CA 94114,USA,Richvalley,20,"USA, North America"
My Address,7,Varanasi,VNS 221003,India,Hattori,1,Asia


**An example for data analysis**

Some tid bids:  The process of converting addresses into latitudes and longitudes is called **Geocoding** and the process of converting latitude and longitude info of a place into address is called **Reverse geocoding**

In [103]:
from geopy.geocoders import ArcGIS
nom = ArcGIS()

In [106]:
n = nom.geocode("Nav Sadhana Kendra, Varanasi, 221003, Uttar Pradesh")
print(type(n))
n

<class 'geopy.location.Location'>


Location(Nava Sadhana Kala Kendra, (25.372200000000078, 82.92787000000004, 0.0))

Oh WoW! it works

Don't worry about the 0.0 in the end. It's just a response from the geocoder and doesn't mean much. Sometimes it may give a None object in case the location is incorrect (oooorr... its top secret).

Also also, isn't the type of location is interesting, it is an especial object, location object of geopy. Let's extract the latitude and longitude data from it

In [107]:
print(n.latitude, n.longitude)

25.372200000000078 82.92787000000004


Now lets convert an entire dataframe of addresses into latitudes and longitudes and then add two columns of the extracted information to the data frame

In [109]:
data = pd.read_csv("Dataset\supermarkets.csv")
data

Unnamed: 0,ID,Address,City,State,Country,Name,Employees
0,1,3666 21st St,San Francisco,CA 94114,USA,Madeira,8
1,2,735 Dolores St,San Francisco,CA 94119,USA,Bready Shop,15
2,3,332 Hill St,San Francisco,California 94114,USA,Super River,25
3,4,3995 23rd St,San Francisco,CA 94114,USA,Ben's Shop,10
4,5,1056 Sanchez St,San Francisco,California,USA,Sanchez,12
5,6,551 Alvarado St,San Francisco,CA 94114,USA,Richvalley,20


Well the geocoder expects from us a string as an input usually of form "road, city, zip code, country". Basically an address string and not a dataframe. So, lets modify the address column to meet our requirements and then pass the data to the geocoder

In [111]:
data["Address"] = data["Address"] + ", " + data["City"] + ", " + data["State"] + ", " + data["Country"]
data

Unnamed: 0,ID,Address,City,State,Country,Name,Employees
0,1,"3666 21st St, San Francisco, CA 94114, USA, Sa...",San Francisco,CA 94114,USA,Madeira,8
1,2,"735 Dolores St, San Francisco, CA 94119, USA, ...",San Francisco,CA 94119,USA,Bready Shop,15
2,3,"332 Hill St, San Francisco, California 94114, ...",San Francisco,California 94114,USA,Super River,25
3,4,"3995 23rd St, San Francisco, CA 94114, USA, Sa...",San Francisco,CA 94114,USA,Ben's Shop,10
4,5,"1056 Sanchez St, San Francisco, California, US...",San Francisco,California,USA,Sanchez,12
5,6,"551 Alvarado St, San Francisco, CA 94114, USA,...",San Francisco,CA 94114,USA,Richvalley,20


You might be thinking of iterating, but with pandas you don't need to do that as pandas has some functions that allows you to apply a method to all the rows of a dataframe. To do that we need to create a new column/series and then assign it the output of data["column to apply on"].**apply(function name)**

In [112]:
data["Coordinates"] = data["Address"].apply(nom.geocode)
data

Unnamed: 0,ID,Address,City,State,Country,Name,Employees,Coordinates
0,1,"3666 21st St, San Francisco, CA 94114, USA, Sa...",San Francisco,CA 94114,USA,Madeira,8,"(3666 21st St, San Francisco, California, 9411..."
1,2,"735 Dolores St, San Francisco, CA 94119, USA, ...",San Francisco,CA 94119,USA,Bready Shop,15,"(735 Dolores St, San Francisco, California, 94..."
2,3,"332 Hill St, San Francisco, California 94114, ...",San Francisco,California 94114,USA,Super River,25,"(332 Hill St, San Francisco, California, 94114..."
3,4,"3995 23rd St, San Francisco, CA 94114, USA, Sa...",San Francisco,CA 94114,USA,Ben's Shop,10,"(3995 23rd St, San Francisco, California, 9411..."
4,5,"1056 Sanchez St, San Francisco, California, US...",San Francisco,California,USA,Sanchez,12,"(1056 Sanchez St, San Francisco, California, 9..."
5,6,"551 Alvarado St, San Francisco, CA 94114, USA,...",San Francisco,CA 94114,USA,Richvalley,20,"(551 Alvarado St, San Francisco, California, 9..."


Now the coordinates column of the dataframe contains the location objects corresponding to the addresses 

Now to add two more columns to save the latitude and the longitudes of the data points separately we'll use **apply** method in conj7uction with a lambda expression

In [115]:
data["Latitude"] = data["Coordinates"].apply(lambda x : x.latitude if x != None else None)  # added a conditional just to be sure that the scrip doesn't crash in case someone played a trick with us or made an honest mistake 
data["Longitude"] = data["Coordinates"].apply(lambda x : x.longitude if x != None else None)
data

Unnamed: 0,ID,Address,City,State,Country,Name,Employees,Coordinates,Latitude,Longitude
0,1,"3666 21st St, San Francisco, CA 94114, USA, Sa...",San Francisco,CA 94114,USA,Madeira,8,"(3666 21st St, San Francisco, California, 9411...",37.756632,-122.429411
1,2,"735 Dolores St, San Francisco, CA 94119, USA, ...",San Francisco,CA 94119,USA,Bready Shop,15,"(735 Dolores St, San Francisco, California, 94...",37.757829,-122.425373
2,3,"332 Hill St, San Francisco, California 94114, ...",San Francisco,California 94114,USA,Super River,25,"(332 Hill St, San Francisco, California, 94114...",37.755845,-122.428813
3,4,"3995 23rd St, San Francisco, CA 94114, USA, Sa...",San Francisco,CA 94114,USA,Ben's Shop,10,"(3995 23rd St, San Francisco, California, 9411...",37.752921,-122.431703
4,5,"1056 Sanchez St, San Francisco, California, US...",San Francisco,California,USA,Sanchez,12,"(1056 Sanchez St, San Francisco, California, 9...",37.752132,-122.430006
5,6,"551 Alvarado St, San Francisco, CA 94114, USA,...",San Francisco,CA 94114,USA,Richvalley,20,"(551 Alvarado St, San Francisco, California, 9...",37.753582,-122.433234
