# Data Analysis Homework 1: Pandas and Numpy

* Create a new Jupyter Notebook. Import all necessary libraries.
* Write a brief summary of your findings. Add comments and Markdown cells in your Jupyter Notebook to explain your code and results. (10 points) 

In [32]:
import pandas as pd
import numpy as np
import math

* In the above cell, pandas was imported as pd, and numpy as np. 
* In this assignment, we can utilize the shorthand notations pd and np whenever referencing pandas and numpy, respectively.

Q1(30 point): Implement a class for n-sided polygons and a class for points in a Euclidean system, namely polygon and point respectively. For example, a 4-sided polygon can be defined by 4 points P1, P2, P3, P4, and P1-P4 are each points of the form point(X,Y), and X and Y are coordinates on the X and Y axis, respectively. The edges are listed counterclockwise starting at the lower left: P1 to P2, P2 to P3, P3 to P4, and P4 to P1. The polygon class should work for polygons of any number of edges and have a function perimeter that returns its perimeter (sum of the lengths of the edges). (20points)

* Hint: use the Pythagorian theorem: if a line segment Z starts at (X1,Y1) and ends at (X2, Y2), the length of Z is the square root of (X1-X2)^2 + (Y1-Y2)^2.

* Example: The perimeter of the polygon/triangle on point(1,1), point(1,2), and point(2,2) is 3.4; The perimeter of the 4-sided polygon on point(2,1), point(2,3), point(6,3), and point(4,1) is 10.8; print out these two examples. (10points)

## Implement a class for points namely point.

In [33]:

class Point: # declaring the class for points
    def __init__(self,x,y):
        self.x=x
        self.y=y

## Implementing a class for n-sided polygons namely polygon.

In [34]:
class Polygon:
    def __init__(self,*points):
        self.points=points

    def perimeter(self):
        if len(self.points) <= 2:
            return 0  # A polygon must have atleast 3 sides for it to be called a polygon

        total=0
        for i in range(len(self.points)):
            p1=self.points[i]
            p2=self.points[(i+1)%len(self.points)]  #The first point for the last edge
            length = self.distance(p1, p2) # calculating the distance between 2 points
            total+=length # summing up the distances

        return total # returning the perimeter value

    def distance(self,point1,point2):
        return math.sqrt((point2.x-point1.x)**2+(point2.y-point1.y)**2)
        # applying Pythagorian theorem to calculate the distance between 2 points


## Defining a function to read coordinates of polygon from user

In [35]:
def create():
    sides = int(input("Enter the number of sides for the polygon: "))
    if(sides<=2):
        print("a polygon must have at least three sides to be considered a polygon.") # checking with a polygon condition.
    points=[]
    for i in range(sides): # loop for reading coordinates of each point from user
        x=float(input(f"X-coordinate for point {i+1}:"))
        y=float(input(f"Y-coordinate for point {i+1}:"))
        points.append(Point(x,y)) #appending the coordinates
    return Polygon(*points)

## Creating a loop to reuse the perimeter function and exit when required

In [36]:
while True:
    print("\n1. Create Polygon")
    print("2. Exit")
    choice=input("Enter your choice (1 or 2): ")

    if choice=='1':
        polygon=create()# creating the polygon
        perimeter=polygon.perimeter()# calculating the perimeter for the polygon
        print(f"\n The perimeter of the polygon is: {perimeter}")
    elif choice=='2':
        print("Exiting.....")
        break
    else:
        print("Invalid choice. Please enter 1 or 2.")


1. Create Polygon
2. Exit
Enter your choice (1 or 2): 1
Enter the number of sides for the polygon: 2
a polygon must have at least three sides to be considered a polygon.
X-coordinate for point 1:1
Y-coordinate for point 1:1
X-coordinate for point 2:1
Y-coordinate for point 2:1

 The perimeter of the polygon is: 0

1. Create Polygon
2. Exit
Enter your choice (1 or 2): 1
Enter the number of sides for the polygon: 3
X-coordinate for point 1:1
Y-coordinate for point 1:1
X-coordinate for point 2:1
Y-coordinate for point 2:2
X-coordinate for point 3:2
Y-coordinate for point 3:2

 The perimeter of the polygon is: 3.414213562373095

1. Create Polygon
2. Exit
Enter your choice (1 or 2): 1
Enter the number of sides for the polygon: 4
X-coordinate for point 1:2
Y-coordinate for point 1:1
X-coordinate for point 2:2
Y-coordinate for point 2:3
X-coordinate for point 3:6
Y-coordinate for point 3:3
X-coordinate for point 4:4
Y-coordinate for point 4:1

 The perimeter of the polygon is: 10.82842712474

## Code Analysis
This question had 3 main tasks
* Implemeting class for points as well as polygon.
* The class should work for polygons of any number of edges.
* Have a function perimeter that returns its perimeter


As per the comments provided in the above code:

* Initially, a class for points was instantiated, capable of holding the x and y coordinates of a point.
* Then, a class for polygons was initialized, equipped with functions for computing the distance between two points and determining the total perimeter.
  * The Pythagorean theorem (square root of (X1-X2)^2 + (Y1-Y2)^2) is employed to calculate the distance between two points.
* To facilitate the input of coordinates from the user, a function was implemented. This function prompts the user for the number of sides and then reads the corresponding coordinates for those points.
* For user-friendly utilization of the perimeter function and the ability to exit when necessary, a loop structure is used in the program.


## Analysis for 2 or less points for a polygon
In the previous example,if we provided only two points, that implies the presence of only one side. However, the minimum criterion for recognizing a polygon is established by the definition of a polygon and the attributes of geometric shapes. A polygon is strictly defined as a closed figure with straight sides, formed by connecting a minimum of three non-collinear points (vertices) through straight line segments. Consequently, any polygon must have at least three edges; otherwise, it cannot be classified as a polygon, and the perimeter of such a figure cannot be calculated.

## Q2(50 point):

* Use Pandas to load both data/AIS/transit_segments.csv, and data/AIS/vessel_information.csv. Show the first 5 rows of each dataset to inspect it.(10points)
* For data/AIS/vessel_information.csv, keep only those rows with the type value occurring for at least 100 times in the dataset. (10points)
* Merge data/AIS/vessel_information.csv and data/AIS/transit_segments.csv on the "mmsi" column using outer join. (10points)
* If you are not allowed to call the inner join provided by Pandas but have the above outer join results, how to get the results of inner join? You can use other functions provided by Pandas (but not a function that directly implements the inner join). (10points)
* Now directly call the inner join provided by Pandas, check whether your results above are exactly the same.(10points)

## Use Pandas to load both data/AIS/transit_segments.csv,and data/AIS/vessel_information.csv.
* Show the first 5 rows of each dataset to inspect it.

In [18]:
#Show the first 5 rows of each dataset to inspect it.

vessel=pd.read_csv("vessel_information.csv")
vessel.head()

Unnamed: 0,mmsi,num_names,names,sov,flag,flag_type,num_loas,loa,max_loa,num_types,type
0,1,8,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4,Dredging/MilOps/Reserved/Towing
1,9,3,000000009/Raven/Shearwater,N,Unknown,Unknown,2,50.0/62.0,62.0,2,Pleasure/Tug
2,21,1,Us Gov Vessel,Y,Unknown,Unknown,1,208.0,208.0,1,Unknown
3,74,2,Mcfaul/Sarah Bell,N,Unknown,Unknown,1,155.0,155.0,1,Unknown
4,103,3,Ron G/Us Navy Warship 103/Us Warship 103,Y,Unknown,Unknown,2,26.0/155.0,155.0,2,Tanker/Unknown


In [31]:
vessel.shape

(10771, 11)

* dimentions of vessel_information.csv are (10771, 11)

In [19]:
# Show the first 5 rows of each dataset to inspect it.

transit=pd.read_csv("transit_segments.csv")
transit.head()

Unnamed: 0,mmsi,name,transit,segment,seg_length,avg_sog,min_sog,max_sog,pdgt10,st_time,end_time
0,1,Us Govt Ves,1,1,5.1,13.2,9.2,14.5,96.5,2/10/09 16:03,2/10/09 16:27
1,1,Dredge Capt Frank,1,1,13.5,18.6,10.4,20.6,100.0,4/6/09 14:31,4/6/09 15:20
2,1,Us Gov Vessel,1,1,4.3,16.2,10.3,20.5,100.0,4/6/09 14:36,4/6/09 14:55
3,1,Us Gov Vessel,2,1,9.2,15.4,14.5,16.1,100.0,4/10/09 17:58,4/10/09 18:34
4,1,Dredge Capt Frank,2,1,9.2,15.4,14.6,16.2,100.0,4/10/09 17:59,4/10/09 18:35


In [30]:
transit.shape

(262526, 11)

* dimentions of transit_segments.csv are (262526, 11)

## keep only those rows with the type value occurring for at least 100 times in the dataset.

In [37]:
df=pd.read_csv('vessel_information.csv')

types=df['type'].value_counts() #counting values
filtered=types[types>= 100].index  # filtering values that occur 100 or more times

filtered_df=df[df['type'].isin(filtered)] # storing filtered data
filtered_df

Unnamed: 0,mmsi,num_names,names,sov,flag,flag_type,num_loas,loa,max_loa,num_types,type
2,21,1,Us Gov Vessel,Y,Unknown,Unknown,1,208.0,208.0,1,Unknown
3,74,2,Mcfaul/Sarah Bell,N,Unknown,Unknown,1,155.0,155.0,1,Unknown
5,310,1,Arabella,N,Bermuda,Foreign,1,47.0,47.0,1,Unknown
6,3011,1,Charleston,N,Anguilla,Foreign,1,160.0,160.0,1,Other
7,4731,1,000004731,N,Yemen (Republic of),Foreign,1,30.0,30.0,1,Unknown
...,...,...,...,...,...,...,...,...,...,...,...
10762,866946820,1,Catherine Turecamo,N,Unknown,Unknown,2,0.0/33.0,33.0,1,Tug
10764,888888888,1,Earl Jones,N,Unknown,Unknown,1,40.0,40.0,1,Towing
10766,919191919,1,Oi,N,Unknown,Unknown,1,20.0,20.0,1,Pleasure
10768,975318642,1,Island Express,N,Unknown,Unknown,1,20.0,20.0,1,Towing


* A DataFrame, named df, is constructed by reading the contents of the 'vessel_information.csv' file.
* The count of occurrences for each unique value in the 'type' column is calculated.
* A check is performed to verify if the count values are equal to or greater than 100.
* Values in the 'type' column that occur more than 100 times are filtered out.
* A new DataFrame is generated using the filtered values, and the resulting DataFrame is displayed.

## Merge data/AIS/vessel_information.csv and data/AIS/transit_segments.csv on the "mmsi" column using outer join.

In [38]:
# Merge data/AIS/vessel_information.csv and data/AIS/transit_segments.csv on the "mmsi" column using outer join.

merg=pd.merge(vessel,transit,on='mmsi',how='outer') # outer join
merg

Unnamed: 0,mmsi,num_names,names,sov,flag,flag_type,num_loas,loa,max_loa,num_types,...,name,transit,segment,seg_length,avg_sog,min_sog,max_sog,pdgt10,st_time,end_time
0,1,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,...,Us Govt Ves,1,1,5.1,13.2,9.2,14.5,96.5,2/10/09 16:03,2/10/09 16:27
1,1,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,...,Dredge Capt Frank,1,1,13.5,18.6,10.4,20.6,100.0,4/6/09 14:31,4/6/09 15:20
2,1,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,...,Us Gov Vessel,1,1,4.3,16.2,10.3,20.5,100.0,4/6/09 14:36,4/6/09 14:55
3,1,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,...,Us Gov Vessel,2,1,9.2,15.4,14.5,16.1,100.0,4/10/09 17:58,4/10/09 18:34
4,1,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,...,Dredge Capt Frank,2,1,9.2,15.4,14.6,16.2,100.0,4/10/09 17:59,4/10/09 18:35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262521,666909000,,,,,,,,,,...,Cg213,1,1,69.7,8.9,0.1,16.9,76.4,11/3/08 12:28,11/3/08 22:02
262522,666909000,,,,,,,,,,...,Cg204,1,1,37.4,5.3,0.0,11.5,45.2,11/8/08 15:38,11/8/08 22:51
262523,666909000,,,,,,,,,,...,Cg204,2,1,20.8,10.7,0.0,15.5,76.9,11/9/08 14:14,11/9/08 16:11
262524,666909000,,,,,,,,,,...,Cg204,3,1,49.4,9.3,0.0,15.2,60.1,11/10/08 19:48,11/11/08 1:06


* An outer join is executed using the merge function, with the common column being 'mmsi', between two datasets.
* After the outer join operation, the resulting DataFrame has dimensions of (262526 rows × 21 columns).

## Using outer join results, how to get the results of inner join? Use other functions provided by Pandas.

In [41]:


df_inner=merg[~(merg['names'].isna())] # removing null values from merg
df_inner


Unnamed: 0,mmsi,num_names,names,sov,flag,flag_type,num_loas,loa,max_loa,num_types,...,name,transit,segment,seg_length,avg_sog,min_sog,max_sog,pdgt10,st_time,end_time
0,1,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,...,Us Govt Ves,1,1,5.1,13.2,9.2,14.5,96.5,2/10/09 16:03,2/10/09 16:27
1,1,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,...,Dredge Capt Frank,1,1,13.5,18.6,10.4,20.6,100.0,4/6/09 14:31,4/6/09 15:20
2,1,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,...,Us Gov Vessel,1,1,4.3,16.2,10.3,20.5,100.0,4/6/09 14:36,4/6/09 14:55
3,1,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,...,Us Gov Vessel,2,1,9.2,15.4,14.5,16.1,100.0,4/10/09 17:58,4/10/09 18:34
4,1,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,...,Dredge Capt Frank,2,1,9.2,15.4,14.6,16.2,100.0,4/10/09 17:59,4/10/09 18:35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262348,999999999,1.0,Triple Attraction,N,Unknown,Unknown,1.0,30.0,30.0,1.0,...,Triple Attraction,3,1,5.3,20.0,19.6,20.4,100.0,6/15/10 12:49,6/15/10 13:05
262349,999999999,1.0,Triple Attraction,N,Unknown,Unknown,1.0,30.0,30.0,1.0,...,Triple Attraction,4,1,18.7,19.2,18.4,19.9,100.0,6/15/10 21:32,6/15/10 22:29
262350,999999999,1.0,Triple Attraction,N,Unknown,Unknown,1.0,30.0,30.0,1.0,...,Triple Attraction,6,1,17.4,17.0,14.7,18.4,100.0,6/17/10 19:16,6/17/10 20:17
262351,999999999,1.0,Triple Attraction,N,Unknown,Unknown,1.0,30.0,30.0,1.0,...,Triple Attraction,7,1,31.5,14.2,13.4,15.1,100.0,6/18/10 2:52,6/18/10 5:03


* To achieved an inner join without using the inner keyword explicitly.
* we perform an outer join using pd.merge on the 'mmsi' column.
* The results of the outer join (named 'merg') are utilized, and null values are removed from the dataset using the isna() function.
* The resulting DataFrame is now free of null values, representing the outcome of an inner join.
* Therefore, by leveraging the outcome of the outer join and eliminating null values, an inner join is effectively performed.
*  The resulting DataFrame has dimensions of (262353 rows × 21 columns).
   * we can check if the results are correct by performing the inner join. we can see the dimentions of this dataframe and compare after performing inner join 

## Call the inner join provided by Pandas to check whether the results above are exactly the same.

In [40]:
# Now directly call the inner join provided by Pandas, check whether your results above are exactly the same.

merg2=pd.merge(vessel, transit, on='mmsi',how='inner') # inner join
merg2

Unnamed: 0,mmsi,num_names,names,sov,flag,flag_type,num_loas,loa,max_loa,num_types,...,name,transit,segment,seg_length,avg_sog,min_sog,max_sog,pdgt10,st_time,end_time
0,1,8,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4,...,Us Govt Ves,1,1,5.1,13.2,9.2,14.5,96.5,2/10/09 16:03,2/10/09 16:27
1,1,8,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4,...,Dredge Capt Frank,1,1,13.5,18.6,10.4,20.6,100.0,4/6/09 14:31,4/6/09 15:20
2,1,8,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4,...,Us Gov Vessel,1,1,4.3,16.2,10.3,20.5,100.0,4/6/09 14:36,4/6/09 14:55
3,1,8,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4,...,Us Gov Vessel,2,1,9.2,15.4,14.5,16.1,100.0,4/10/09 17:58,4/10/09 18:34
4,1,8,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4,...,Dredge Capt Frank,2,1,9.2,15.4,14.6,16.2,100.0,4/10/09 17:59,4/10/09 18:35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262348,999999999,1,Triple Attraction,N,Unknown,Unknown,1,30.0,30.0,1,...,Triple Attraction,3,1,5.3,20.0,19.6,20.4,100.0,6/15/10 12:49,6/15/10 13:05
262349,999999999,1,Triple Attraction,N,Unknown,Unknown,1,30.0,30.0,1,...,Triple Attraction,4,1,18.7,19.2,18.4,19.9,100.0,6/15/10 21:32,6/15/10 22:29
262350,999999999,1,Triple Attraction,N,Unknown,Unknown,1,30.0,30.0,1,...,Triple Attraction,6,1,17.4,17.0,14.7,18.4,100.0,6/17/10 19:16,6/17/10 20:17
262351,999999999,1,Triple Attraction,N,Unknown,Unknown,1,30.0,30.0,1,...,Triple Attraction,7,1,31.5,14.2,13.4,15.1,100.0,6/18/10 2:52,6/18/10 5:03


* The resulting DataFrame has dimensions of (262353 rows × 21 columns).
* The dimensions mentioned match those stated earlier, confirming that the correct results were obtained from the inner join operation.
* This consistency in dimensions provides confirmation that the inner join was successful and aligned with the expectations mentioned earlier.