# Training a model to predict shape based on time and space
$Mahsa Bakhtiari$

I queried the database for dates that had more that 10 reports and was been reported from more than two countries. then trained a decision tree model to see if there is a distinct pattern between the spacetime of the sightings and the reported time.

In [1]:
from datetime import datetime
import pandas as pd
import sklearn

In [2]:
data = pd.read_csv('./Resources/data.csv', index_col='Unnamed: 0')

In [3]:
data.head(2)

Unnamed: 0,timestamp,duration (seconds),city,state,country,latitude,longitude,shape,comments
0,1949-10-10 20:30:00,2700,san marcos,tx,us,29.883056,-97.941111,cylinder,This event took place in early fall around 194...
3,1956-10-10 21:00:00,20,edna,tx,us,28.978333,-96.645833,circle,My older brother and twin sister were leaving ...


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66516 entries, 0 to 80331
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   timestamp           66516 non-null  object 
 1   duration (seconds)  66516 non-null  int64  
 2   city                66516 non-null  object 
 3   state               66516 non-null  object 
 4   country             66516 non-null  object 
 5   latitude            66516 non-null  float64
 6   longitude           66516 non-null  float64
 7   shape               66516 non-null  object 
 8   comments            66516 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 5.1+ MB


In [5]:
data["timestamp"] = pd.to_datetime(data["timestamp"])

In [6]:
data["date"] = data["timestamp"].map(lambda x: x.date())

In [7]:
data["time"] = data["timestamp"].map(lambda x: x.time())

In [8]:
data.head(2)

Unnamed: 0,timestamp,duration (seconds),city,state,country,latitude,longitude,shape,comments,date,time
0,1949-10-10 20:30:00,2700,san marcos,tx,us,29.883056,-97.941111,cylinder,This event took place in early fall around 194...,1949-10-10,20:30:00
3,1956-10-10 21:00:00,20,edna,tx,us,28.978333,-96.645833,circle,My older brother and twin sister were leaving ...,1956-10-10,21:00:00


In [9]:
target = data.groupby(["date"])["date"].count()

In [10]:
#target = pd.DataFrame(data=target, index=None).reset_index()
target = target.loc[lambda x: x>10]

In [11]:
dates = target.index.tolist()

In [12]:
data.loc[data.date == dates[0]]

Unnamed: 0,timestamp,duration (seconds),city,state,country,latitude,longitude,shape,comments,date,time
46934,1963-06-01 00:00:00,900,cocoa,fl,us,28.385833,-80.742222,unknown,We are not alone&#44 I swear it before God Al...,1963-06-01,00:00:00
46935,1963-06-01 02:00:00,300,st. louis county,mo,us,38.627222,-90.197778,cigar,Cigar shaped craft hovering over swimming pool...,1963-06-01,02:00:00
46936,1963-06-01 08:30:00,30,gonzales,tx,us,29.501389,-97.452222,cigar,Object was cigar shaped&#44lights along side a...,1963-06-01,08:30:00
46937,1963-06-01 11:00:00,1200,albemarle,nc,us,35.35,-80.200278,disk,outside playling &#44neighbor starting yelling...,1963-06-01,11:00:00
46938,1963-06-01 12:00:00,300,houston,tx,us,29.763056,-95.363056,cigar,Cigar shaped UFO over north Houston&#44&#44 b...,1963-06-01,12:00:00
46939,1963-06-01 13:00:00,60,kirksey,ky,us,36.698611,-88.395278,triangle,Triangle shaped UFO seen in daytime in 1963 in...,1963-06-01,13:00:00
46940,1963-06-01 16:00:00,900,waukegan,il,us,42.363611,-87.844722,formation,Group of stationary objects high in afternoon ...,1963-06-01,16:00:00
46941,1963-06-01 18:00:00,180,royal oak,mi,us,42.489444,-83.144722,triangle,A hovering triangular craft&#44 with moving wh...,1963-06-01,18:00:00
46942,1963-06-01 20:00:00,120,laguna beach (south),ca,us,33.542222,-117.782222,light,USO(s) - Unidentified Submerged Object(s) -- y...,1963-06-01,20:00:00
46943,1963-06-01 21:00:00,300,rodeo,ca,us,38.033056,-122.265833,sphere,Blue humming and vibrating sphere in my room,1963-06-01,21:00:00


In [13]:
def sametime(date):
    x = data.loc[data.date == date]
    n = x["country"].nunique()
    return n > 2

In [14]:
valid_dates = []
for d in dates:
    if sametime(d) == True: 
        valid_dates.append(d)


In [15]:
valid_dates 

[datetime.date(1978, 6, 1),
 datetime.date(2000, 6, 15),
 datetime.date(2003, 11, 22),
 datetime.date(2007, 4, 29),
 datetime.date(2013, 10, 3)]

In [16]:
train = data.loc[(data.date == valid_dates[0]) | (data.date == valid_dates[1]) | (data.date == valid_dates[2]) | (data.date == valid_dates[3])].sort_values(by="time")

In [17]:
validate = data.loc[data.date == valid_dates[4]].sort_values(by="time")

In [18]:
train['shape'].unique()

array(['disk', 'light', 'circle', 'triangle', 'cigar', 'unknown',
       'sphere', 'other', 'oval', 'egg', 'cylinder', 'fireball',
       'formation', 'rectangle', 'changing'], dtype=object)

In [19]:
validate['shape'].unique()

array(['triangle', 'light', 'flash', 'circle', 'rectangle', 'other',
       'fireball', 'oval'], dtype=object)

In [20]:
def shaper(shape):
    """ A function that turns every obseration of dataframe shape into a coresponding number"""
    if shape in ('disk', 'circle', 'cigar','sphere',  'oval', 'egg', 'cylinder', 'fireball'):
        return 0
    if shape in ('triangle', 'rectangle'):
        return 1
    if shape in ('light', 'flash'):
        return 2
    else:
        return 3         

In [21]:
train["label"] = train["shape"].map(shaper)

In [22]:
validate["label"] = validate["shape"].map(shaper)

In [23]:
validate["hour"] = validate["time"].map(lambda x: int(str(x).split(":")[0]))

In [24]:
train["hour"] = train["time"].map(lambda x: int(str(x).split(":")[0]))

In [25]:
validate.head(3)

Unnamed: 0,timestamp,duration (seconds),city,state,country,latitude,longitude,shape,comments,date,time,label,hour
6201,2013-10-03 00:10:00,300,athens,ga,us,33.960833,-83.378056,triangle,Triangle-shaped object the size of a go-kart w...,2013-10-03,00:10:00,1,0
6202,2013-10-03 01:30:00,25,baltimore (city line),md,us,39.290278,-76.6125,light,On the evening of September 3rd&#44 I was smok...,2013-10-03,01:30:00,2,1
6203,2013-10-03 02:00:00,25,waterbury,vt,us,44.337778,-72.756667,light,Two bright glowing objects that were parallel ...,2013-10-03,02:00:00,2,2


In [26]:
train.head(3)

Unnamed: 0,timestamp,duration (seconds),city,state,country,latitude,longitude,shape,comments,date,time,label,hour
47220,1978-06-01 00:00:00,7200,grants pass,or,us,42.439167,-123.327222,disk,Object landed on corner of street. Stayed the...,1978-06-01,00:00:00,0,0
12297,2003-11-22 00:30:00,10,harding (near),wv,us,38.948056,-79.959444,disk,Transluscent disc shaped ufo flying right at u...,2003-11-22,00:30:00,0,0
47221,1978-06-01 00:30:00,15,grand saline,tx,us,32.673333,-95.709167,light,Blinding white light over my pickup then sudde...,1978-06-01,00:30:00,2,0


In [27]:
train_features = train[["latitude", "longitude", "hour"]].values.tolist()

In [28]:
train_labels = train["label"].tolist()

In [29]:
validate_features = validate[["latitude", "longitude", "hour"]].values.tolist()

In [30]:
validate_labels = validate["label"].tolist()

In [31]:
from sklearn.ensemble import RandomForestClassifier

In [32]:
clf = RandomForestClassifier()

In [33]:
clf.fit(train_features, train_labels)

In [34]:
clf.predict([train_features[0]])

array([0])

In [35]:
clf.score(validate_features, validate_labels)

0.3157894736842105

This shows that the latitude, longitude amd time can not acuretly predict the obsereved UFO shape in a new day, becouse the accuracy is only a little bit higher than random guess.

In [36]:
clf.score(train_features, train_labels)

1.0

 As you can see the accuracy in train data is 100%, so the model is over fitting on the training set, but the prediction rules do not generalize to unseen data. 

## the final analysis 
We can conclude that the relationship between the features in the first four dates, do not apply to the last day, hence this can be seen as an ecvidense that there is no recognisable pattern between spacetime and reported shape.