# Data Visualization semester project
## Meteorite landings data pre-processing

#### Libraries:

In [69]:
import pandas as pd

Import the meteorite landings dataset from the NASA website.

In [99]:
URL: str ="https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD"
df: pd.DataFrame = pd.read_csv(URL)

### Pre processing:

#### Brief exploration

In [100]:
print(df.shape)
df.dtypes

(45716, 10)


name            object
id               int64
nametype        object
recclass        object
mass (g)       float64
fall            object
year            object
reclat         float64
reclong        float64
GeoLocation     object
dtype: object

In [101]:
df.head()

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000.0,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780.0,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


We have:
- 6 categorical variables of interest: id, name, nametype, recclass fall and year.
    - if there are no duplicate names id does not provide added information.
    - year is stored in the wrong format
- 3 numerical variables of interest: mass, longitude and latitude.
    - mass does not follow the naming convention of the rest of the dataset.
    - GeoLocation does not contain additional information.

In [102]:
df['name'].duplicated().any()

False

Since the name uniquely identifies the meteorite we can drop the id together with GeoLocation

In [74]:
df.drop(['id', 'GeoLocation'], axis='columns', inplace=True)

renaming mass (g) to mass and giving more meaningful names to nametype,
which represents whether the meteorite landed as a meteorite or is now a
relic after landing, renamed to status and fall, which represents how the meteorite
was detected, while falling or found, to detection method.

Renaming reclat and reclong and recclass to latitude and longitude and recommended
classification, a standard on how to categorize meteorites.

In [104]:
df.rename(columns={'mass (g)': 'mass',
                   'nametype': 'status',
                   'fall': 'detection method',
                   'year': 'year fell or found',
                   'reclat': 'latitude',
                   'reclong': 'longitude',
                   'recclass': 'recommended classification'}, inplace=True)
df.head()

Unnamed: 0,name,id,status,recommended classification,mass,detection method,year fell or found,latitude,longitude,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000.0,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780.0,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


Fixing the years

In [105]:
from numpy.array_api import astype

df["year fell or found"] = df["year fell or found"].str[6:10]

df["year fell or found"].head()

0    1880
1    1951
2    1952
3    1976
4    1902
Name: year fell or found, dtype: object

Exploring the missing values:

In [60]:
df.isnull().sum()

name                             0
status                           0
recommended classification       0
mass                           131
detection method                 0
year fell or found             291
latitude                      7315
longitude                     7315
dtype: int64

Our visualization requires geolocation, hence we drop the meteorites with
missing latitude and longitude coordinates.

In [61]:
df = df[df['latitude'].notna()]
df.isnull().sum()

name                            0
status                          0
recommended classification      0
mass                          119
detection method                0
year fell or found            178
latitude                        0
longitude                       0
dtype: int64

In [62]:
df.shape

(38401, 8)

Year and mass are not as important for us, hence we decided to keep those rows.
Saving the dataset:

In [63]:
df.to_csv("meteorite_landings_preprocessed.csv", index=False)

## Below we show our aborted attempt to geo-locate the landing sites by country.
#### Assigning an address to each meteorite landing found by the geo-locator with its coordinates.

##### Connection done using GeoPy's git-hub repository code:
https://github.com/geopy/geopy/blob/master/geopy/geocoders/mapbox.py

In [64]:
from functools import partial
from urllib.parse import quote, urlencode

from geopy.geocoders.base import DEFAULT_SENTINEL, Geocoder
from geopy.location import Location
from geopy.util import logger

__all__ = ("MapBox", )


class MapBox(Geocoder):
    """Geocoder using the Mapbox API.
    Documentation at:
        https://www.mapbox.com/api-documentation/
    """

    api_path = '/geocoding/v5/mapbox.places/%(query)s.json/'

    def __init__(
            self,
            api_key,
            *,
            scheme=None,
            timeout=DEFAULT_SENTINEL,
            proxies=DEFAULT_SENTINEL,
            user_agent=None,
            ssl_context=DEFAULT_SENTINEL,
            adapter_factory=None,
            domain='api.mapbox.com'
    ):
        """
        :param str api_key: The API key required by Mapbox to perform
            geocoding requests. API keys are managed through Mapox's account
            page (https://www.mapbox.com/account/access-tokens).
        :param str scheme:
            See :attr:`geopy.geocoders.options.default_scheme`.
        :param int timeout:
            See :attr:`geopy.geocoders.options.default_timeout`.
        :param dict proxies:
            See :attr:`geopy.geocoders.options.default_proxies`.
        :param str user_agent:
            See :attr:`geopy.geocoders.options.default_user_agent`.
        :type ssl_context: :class:`ssl.SSLContext`
        :param ssl_context:
            See :attr:`geopy.geocoders.options.default_ssl_context`.
        :param callable adapter_factory:
            See :attr:`geopy.geocoders.options.default_adapter_factory`.
            .. versionadded:: 2.0
        :param str domain: base api domain for mapbox
        """
        super().__init__(
            scheme=scheme,
            timeout=timeout,
            proxies=proxies,
            user_agent=user_agent,
            ssl_context=ssl_context,
            adapter_factory=adapter_factory,
        )
        self.api_key = api_key
        self.domain = domain.strip('/')
        self.api = "%s://%s%s" % (self.scheme, self.domain, self.api_path)

    def _parse_json(self, json, exactly_one=True):
        """Returns location, (latitude, longitude) from json feed."""
        features = json['features']
        if not features:
            return None

        def parse_feature(feature):
            location = feature['place_name']
            longitude = feature['geometry']['coordinates'][0]
            latitude = feature['geometry']['coordinates'][1]
            return Location(location, (latitude, longitude), feature)
        if exactly_one:
            return parse_feature(features[0])
        else:
            return [parse_feature(feature) for feature in features]


    def reverse(
            self,
            query,
            *,
            exactly_one=True,
            timeout=DEFAULT_SENTINEL
    ):
        """
        Return an address by location point.
        :param query: The coordinates for which you wish to obtain the
            closest human-readable addresses.
        :type query: :class:`geopy.point.Point`, list or tuple of ``(latitude,
            longitude)``, or string as ``"%(latitude)s, %(longitude)s"``.
        :param bool exactly_one: Return one result or a list of results, if
            available.
        :param int timeout: Time, in seconds, to wait for the geocoding service
            to respond before raising a :class:`geopy.exc.GeocoderTimedOut`
            exception. Set this only if you wish to override, on this call
            only, the value set during the geocoder's initialization.
        :rtype: ``None``, :class:`geopy.location.Location` or a list of them, if
            ``exactly_one=False``.
        """
        params = {'access_token': self.api_key}

        point = self._coerce_point_to_string(query, "%(lon)s,%(lat)s")
        quoted_query = quote(point.encode('utf-8'))
        url = "?".join((self.api % dict(query=quoted_query),
                        urlencode(params)))
        logger.debug("%s.reverse: %s", self.__class__.__name__, url)
        callback = partial(self._parse_json, exactly_one=exactly_one)
        return self._call_geocoder(url, callback, timeout=timeout)

In [65]:
from typing import Tuple
from typing import List
import time


mapbox: MapBox = MapBox('pk.eyJ1IjoidXR0aSIsImEiOiJja3llNXJzcHcwNGhnMm5wbWszYXc3cDVvIn0.jmunv4h49Qe6bqSvRhqUdw')
points: pd.DataFrame = df[['latitude','longitude']]
point_list: List[Tuple] = [tuple(x) for x in points.to_numpy()]

In [66]:
country = list()

def reverse_geo_locate():
    for i in range(0, len(point_list)):  # len(point_list)):
        # do not exceed API calls per second
        time.sleep(0.005)
        with open('countries.csv', 'a', encoding='utf-8') as f:
            location = mapbox.reverse(point_list[i])
            country.append(location if location else ("no address","no country"))


In [67]:
new_df = pd.DataFrame(country)

In [68]:
new_df.to_csv('countries_last_attempt')

After checking the quality of the geo-located data, in particular the number of missing nations, we decided
to use another strategy to represent the population density.