# M1L7 Data Types, Dates, Strings 

 We'll be working with UFO sighting data.

### **Dataset:** [UFO Sightings](https://www.kaggle.com/datasets/jonwright13/ufo-sightings-around-the-world-better?resource=download) -- This is also in your data folder 

### **Objectives:**

- Change an object to a datetime object 
- Use string methods to manipulate data 


### Step 1:  Import pandas and numpy 

In [20]:
#Import packages 

import pandas as pd
import numpy as np


### Step 2:  Load in the data and save it as `ufo`

- The dataset is named `ufo-sightings.csv`

In [None]:
df = pd.read_csv ('alien.csv')

df


Unnamed: 0.1,Unnamed: 0,Date_time,date_documented,Year,Month,Hour,Season,Country_Code,Country,Region,Locale,latitude,longitude,UFO_shape,length_of_encounter_seconds,Encounter_Duration,Description
0,0,1949-10-10 20:30:00,4/27/2004,1949,10,20,Autumn,USA,United States,Texas,San Marcos,29.883056,-97.941111,Cylinder,2700.0,45 minutes,This event took place in early fall around 194...
1,1,1949-10-10 21:00:00,12/16/2005,1949,10,21,Autumn,USA,United States,Texas,Bexar County,29.384210,-98.581082,Light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...
2,2,1955-10-10 17:00:00,1/21/2008,1955,10,17,Autumn,GBR,United Kingdom,England,Chester,53.200000,-2.916667,Circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...
3,3,1956-10-10 21:00:00,1/17/2004,1956,10,21,Autumn,USA,United States,Texas,Edna,28.978333,-96.645833,Circle,20.0,1/2 hour,My older brother and twin sister were leaving ...
4,4,1960-10-10 20:00:00,1/22/2004,1960,10,20,Autumn,USA,United States,Hawaii,Kaneohe,21.418056,-157.803611,Light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80323,80323,2013-09-09 21:15:00,9/30/2013,2013,9,21,Autumn,USA,United States,Tennessee,Nashville,36.165833,-86.784444,Light,600.0,10 minutes,Round from the distance/slowly changing colors...
80324,80324,2013-09-09 22:00:00,9/30/2013,2013,9,22,Autumn,USA,United States,Idaho,Boise,43.613611,-116.202500,Circle,1200.0,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...
80325,80325,2013-09-09 22:00:00,9/30/2013,2013,9,22,Autumn,USA,United States,California,Napa Abajo,38.297222,-122.284444,Other,1200.0,hour,Napa UFO&#44
80326,80326,2013-09-09 22:20:00,9/30/2013,2013,9,22,Autumn,USA,United States,Virginia,Vienna,38.901111,-77.265556,Circle,5.0,5 seconds,Saw a five gold lit cicular craft moving fastl...


### Step 3: Check column data types and the head of the data -- does the data/types make sense?

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80328 entries, 0 to 80327
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   80328 non-null  int64  
 1   Date_time                    80328 non-null  object 
 2   date_documented              80328 non-null  object 
 3   Year                         80328 non-null  int64  
 4   Month                        80328 non-null  int64  
 5   Hour                         80328 non-null  int64  
 6   Season                       80328 non-null  object 
 7   Country_Code                 80069 non-null  object 
 8   Country                      80069 non-null  object 
 9   Region                       79762 non-null  object 
 10  Locale                       79871 non-null  object 
 11  latitude                     80328 non-null  float64
 12  longitude                    80328 non-null  float64
 13  UFO_shape       

### Step 4:  Convert the `Date` column to datetime 

- Even though we have columns for year, month, and hour; we still want to change Date_time to a datetime object 
- Dates can come in many formats so we will use this format: '%Y-%m-%d %H:%M:%S'

In [29]:
df['Date_time'] = pd.to_datetime(df['Date_time'],format='%Y-%m-%d %H:%M:%S')


In [30]:
#Run this to see if the update worked 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80328 entries, 0 to 80327
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   Unnamed: 0                   80328 non-null  int64         
 1   Date_time                    0 non-null      datetime64[ns]
 2   date_documented              80328 non-null  object        
 3   Year                         80328 non-null  int64         
 4   Month                        80328 non-null  int64         
 5   Hour                         80328 non-null  int64         
 6   Season                       80328 non-null  object        
 7   Country_Code                 80069 non-null  object        
 8   Country                      80069 non-null  object        
 9   Region                       79762 non-null  object        
 10  Locale                       79871 non-null  object        
 11  latitude                     80328 non-nu

### Step 5:  Make the `Description` column all lowercase 

- Think about why would we want text all lowercase 

**Instructor Notes**
Feel free to talk about text analytics or LLMs or a simple case like states being different cases and you want to do aggregations

In [31]:
df['Description'] = df['Description'].str.lower()
print(df['Description'])

0        this event took place in early fall around 194...
1        1949 lackland afb&#44 tx.  lights racing acros...
2        green/orange circular disc over chester&#44 en...
3        my older brother and twin sister were leaving ...
4        as a marine 1st lt. flying an fj4b fighter/att...
                               ...                        
80323    round from the distance/slowly changing colors...
80324    boise&#44 id&#44 spherical&#44 20 min&#44 10 r...
80325                                         napa ufo&#44
80326    saw a five gold lit cicular craft moving fastl...
80327    2 witnesses 2  miles apart&#44 red &amp; white...
Name: Description, Length: 80328, dtype: object


### Step 6:  Replace spaces with underscores in the `Encounter_Duration` column


In [33]:
df['Encounter_Duration'] = df['Encounter_Duration'].str.replace('spaces','underscores')
print(df['Encounter_Duration'])

0        45 minutes
1           1-2 hrs
2        20 seconds
3          1/2 hour
4        15 minutes
            ...    
80323    10 minutes
80324    20 minutes
80325          hour
80326     5 seconds
80327    17 minutes
Name: Encounter_Duration, Length: 80328, dtype: object
