# Let's catch UFO

1. Key things to include: Project Overview #Stakeholder & Business QUESTIONS #Data Soucres #Links to notebooks, preentations, and dashboards

2. Additional Notes Business Problem: A clear statement of the business problem you are solving.

3. Dataset Information: A description of your dataset(s) and how you plan to use it.

4. Methods: Brief explanation of your analysis steps.

5. Results and Recommendations: What insights and recommendations are you providing to the stakeholder?

6. Links: Include links to the Jupyter notebook, dashboard, and presentation files.

Remember no code just visuals non techical

Example Template:

Project Title

Overview A concise summary of the project, including the purpose, the problem it addresses, and the key findings.
Goal of the project: Clearly state the objective. Context: Brief background or motivation for the project. Main results/insights: Summarize key outcomes or insights from your analysis.

Repository Structure (Probably should go at the bottom Provide an outline of the repository, explaining what each folder and file contains. 📁 /data # Contains raw and cleaned datasets 📁 /notebooks # Jupyter Notebooks or code scripts used in analysis 📁 /scripts # Python or other scripts for data cleaning and modeling 📁 /images # Graphs, figures, Tableau dashboard files 📄 README.md # Documentation of the project 📄 requirements.txt # Packages and dependencies needed to run the code 📄 presentation.pdf # Final presentation slides
/data: A brief description of the datasets used, including sources. /notebooks: Notebooks detailing data exploration, cleaning, analysis, and modeling. /scripts: Python scripts for automating tasks like data processing. /images: Contains final visuals, plots, or links to Tableau dashboards.

Data Science Steps Outline the key steps taken during the project:
Data Collection: How data was sourced (e.g., APIs, web scraping, public datasets). Data Cleaning: Techniques used for cleaning and preprocessing data (e.g., handling missing values). Exploratory Data Analysis (EDA): Summary of insights found during the EDA phase. Modeling: Brief overview of the models used and their performance. Results: Main findings from the analysis or predictive models.

Instructions for Use Guide users on how to navigate the repository, including how to replicate the project on their local machine: (git clone link)

Tableau Dashboard Include a link to the Tableau dashboard:

Tableau Dashboard
6. Presentation Provide a link to the final project presentation:

Sources
List any references or external data sources used: Data Source 1(Kaggle)

Commit History Provide an overview of the commit history to demonstrate project development and collaboration. Link to the repository’s commit history for detailed tracking:
View commit history



1. Stakeholder Questions:
- What regions and times have the highest frequency of UFO sightings?
- Are there notable patterns in UFO shapes, descriptions, or lengths of encounters?
- Can any correlations be drawn between the timing (season, time of day) and the likelihood of a sighting?
((- Is there a potential for identifying anomalies or "false positives" in the sighting reports?))
2. Business Problem: Provide clear and engaging data visualizations that help stakeholders (researchers, enthusiasts, or possibly governmental entities) understand trends in UFO sightings, aiding decision-making for further research or public communication.

In [None]:
#Import
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

In [None]:
df = pd.read_csv('/Users/jeesoojhun/Documents/Flatiron/Phase-1-Project/data/ufo_data/ufo-sightings-transformed.csv', index_col='Unnamed: 0')

In [None]:
df.head(5)

In [None]:
df.info()

In [None]:
df.isnull().sum()

# Data Cleaning

In [None]:
**Time Standardization**

In [None]:
#Standardize Dates and Times
#Convert Date_time to datetime format
df['Date_time'] = pd.to_datetime(df['Date_time'], errors='coerce')
df.head()

In [None]:
#Standardize the date_documented column
df['date_documented'] = pd.to_datetime(df['date_documented'], errors='coerce')
df.head()

In [None]:
#Standardize Year, Month, Hour and Season Column 
#Year, Month, and Hour : are they derived accurately from Date_time?
#Season: grouping months into seasons
df['Year'] = df['Date_time'].dt.year
df['Month'] = df['Date_time'].dt.month
df['Hour'] = df['Date_time'].dt.hour

#Seasons
df['Season'] = df['Month'].apply(lambda x: 'Winter' if x in [12, 1, 2] else 'Spring' if x in [3, 4, 5] else 'Summer' if x in [6, 7, 8] else 'Autumn')

df.head()

In [None]:
#Standardize the length of encounter seconds column
# Convert 'length_of_encounter_seconds' to numeric, forcing errors to NaN
df['length_of_encounter_seconds'] = pd.to_numeric(df['length_of_encounter_seconds'], errors='coerce')

# Remove outliers (e.g., encounters longer than a day)
df = df[df['length_of_encounter_seconds'] <= 86400]  # 86400 seconds = 24 hours

# Verify changes
df['length_of_encounter_seconds'].describe()

In [None]:
# Display unique values to understand various formats in 'Encounter_Duration'
df['Encounter_Duration'].unique()[:20]  # Display first 20 unique values to analyze different formats

In [None]:
#Unify the units into seconds and convert to numeric in Encounter_Duration column

**Handling Geographical Data**

In [None]:
#Standardize Country Codes and Names.
#Ensure that the country codes and Country values are consistent.
df['Country_Code']=df['Country_Code'].str.upper().str.strip()
df['Country'] = df['Country'].str.title().str.strip()

df['Country_Code'].unique()

In [None]:
# Ensure latitude and longitude are within valid ranges
df = df[(df['latitude'] >= -90) & (df['latitude'] <= 90)]
df = df[(df['longitude'] >= -180) & (df['longitude'] <= 180)]

# Verify the changes
df[['latitude', 'longitude']].describe()

In [None]:
#Replace country values with longitudes and latitudes using geopanda or geopy

In [None]:
pip install geopy

In [None]:
#Decide to just drop the rows with missing values in the Country column
# Drop rows with missing values in 'Country_Code', 'Country', 'Region', 'Locale', and 'UFO_shape'
df.dropna(subset=['Country_Code', 'Country', 'Region', 'Locale'], inplace=True)
print(df)

In [None]:
# Verify changes to ensure rows with missing values in the specified columns are dropped
print(df[['Country_Code', 'Country', 'Region', 'Locale', 'UFO_shape']].isnull().sum())

In [None]:
df['UFO_shape'].fillna('Unknown', inplace=True)

# Questions