# UFO Sightings

#### The objective of this assignment is for you to explain what is happening in each cell in clear, understandable language. 

#### _There is no need to code._ The code is there for you, and it already runs. Your task is only to explain what each line in each cell does.

#### The placeholder cells should describe what happens in the cell below it.

**Example**: The cell below imports `pandas` as a dependency because `pandas` functions will be used throughout the program, such as the Pandas `DataFrame` as well as the `read_csv` function.

In [None]:
import pandas as pd

In [None]:
# Names a variable (csv_path) identifying where (the path) the csv file can be found (in the Resources folder) 
# Reads the csv file, using the path name defined earlier, into a dataframe 
# Displays the first 5 rows of the dataframe

csv_path = "Resources/ufoSightings.csv"

ufo_df = pd.read_csv(csv_path)

ufo_df.head()

In [None]:
# Counts the number of entries in each column of the dataframe and provides output in a table.
# This gives both a sense of how many rows of data exists as well as whether each column is populated to the same extent.

ufo_df.count()

In [None]:
# Clean up the dataframe by removing "any" rows that include NA
## If (how="all") were used, and any row or column had included "all" NA values, that row/column would have been removed.
### This is useful because, there would have been no useful information in that row/column.
#### The disadvantage of (how="any") is that it removes all data from a row that includes even just one NA. Because of
##### this, one may miss out on some of the information that was available. e.g., if one of the UFO sightings included 
###### all information except for "state" (which said NA), one could have determined that information from the latitude/longitude
####### and it would still have been a useful data point. On the other hand, removing rows with NAs does make analysis of the 
######## data easier.

clean_ufo_df = ufo_df.dropna(how="any")

# Counts the number of entries in each column of the dataframe and provides output in a table.
# Note that now all columns have the same number of entries (all rows are fully populated with non-NA values)

clean_ufo_df.count()

In [None]:
# Creates a list with a subset of the columns in the original data frame and names the list "columns"

columns = [
    "datetime",
    "city",
    "state",
    "country",
    "shape",
    "duration (seconds)",
    "duration (hours/min)",
    "comments",
    "date posted"
]

# Creates a new dataframe called usa_ufo_df. 
# the new dataframe leverages the cleaned dataframe created in the previous cell, locates the column called "country", 
## and finds all rows that identify the country as "US". This dataframe only includes the columns identified in the 
### variable "columns" (i.e., unlike previous dataframes, it doesn't include latitude and longitude columns)

usa_ufo_df = clean_ufo_df.loc[clean_ufo_df["country"] == "us", columns]

# Prints the first 5 rows of this new dataframe as output
usa_ufo_df.head()

In [None]:
# Defines state_counts as an object that identifies the "state" column in the usa_ufo_df dataframe and 
## counts unique values (i.e., how many times each state (a unique value) is represented in the column).
### This provides a quick summary view that is especially useful because it also sorts the output in
#### descending order. So, one can see, at a glance, how the data is distributed 
##### (e.g., which state has the most and which state has the fewest sightings).

state_counts = usa_ufo_df["state"].value_counts()

# Show state_counts as an output

state_counts

In [None]:
# Places the state_counts table into a dataframe and prints the first 5 rows.

state_ufo_counts_df = pd.DataFrame(state_counts)
state_ufo_counts_df.head()

In [None]:
# Replaces the dataframe created in the previous cell with one that has the same information but one new column heading
## (used .rename to change the column heading from "state" to "Sum of Sightings"). Prints the first 5 rows.

state_ufo_counts_df = state_ufo_counts_df.rename(
    columns={"state": "Sum of Sightings"})
state_ufo_counts_df.head()

In [None]:
# Identifies the type of data resident in each column of the dataframe (e.g., object, integer) and provides this as an output.
## Once one knows this, one can decide whether any types should be changed in order to facilitate manipulation of the data.
usa_ufo_df.dtypes

In [None]:
# Locates the column called "duration(seconds)" and ascribes it a new data type ("float") that is numerical (with decimals)
## and can be used in computations unlike the original data type (object)
usa_ufo_df.loc[:, "duration (seconds)"] = usa_ufo_df["duration (seconds)"].astype("float")

# Shows each column's data types as an output
usa_ufo_df.dtypes

In [None]:
# Now it is possible to find the sum of seconds
## sums up the values in the "duration (seconds)" column - this would not have been possible if the datatype were still "object"
usa_ufo_df["duration (seconds)"].sum()

In [None]:
# Groups data in the usa_ufo_df dataframe first by "state", then by "city" and ascribes the result to an object
## called grouped_data.
grouped_data = usa_ufo_df.groupby(['state', 'city'])

# Hint: If you are counting records, you can use any column and get the same result. Try it.
## Looks for the "datetime" column in the grouped_data object and returns counts sorted alphabetically  
### (first by state and then by cities within the state). It shows how many times there is an entry in the "datetime"
#### column in each city/state. Since there are the same number of entries in each of the columns, one gets the same
##### result if the argument is, for instance, "shape" instead of "datetime"

grouped_data['datetime'].count()