# Data Visualization semester project
## Meteorite landings data pre-processing

#### Libraries:

In [1]:
import pandas as pd

Import the meteorite landings dataset from the NASA website.

In [143]:
URL: str ="https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD"
df: pd.DataFrame = pd.read_csv(URL)

### Pre processing:

#### Brief exploration

In [144]:
print(df.shape)
df.dtypes

(45716, 10)


name            object
id               int64
nametype        object
recclass        object
mass (g)       float64
fall            object
year            object
reclat         float64
reclong        float64
GeoLocation     object
dtype: object

In [145]:
df.head()

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000.0,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780.0,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


We have:
- 6 categorical variables of interest: id, name, nametype, recclass fall and year.
    - if there are no duplicate names id does not provide added information.
    - year is stored in the wrong format
- 3 numerical variables of interest: mass, longitude and latitude.
    - mass does not follow the naming convention of the rest of the dataset.
    - GeoLocation does not contain additional information.

In [146]:
df['name'].duplicated().any()

False

Since the name uniquely identifies the meteorite we can drop the id together with GeoLocation

In [147]:
df.drop(['id', 'GeoLocation'], axis='columns', inplace=True)

renaming mass (g) to mass and giving more meaningful names to nametype,
which represents whether the meteorite landed as a meteorite or is now a
relic after landing, renamed to status and fall, which represents how the meteorite
was detected, while falling or found, to detection method.

Renaming reclat and reclong and recclass to latitude and longitude and recommended
classification, a standard on how to categorize meteorites.

In [148]:
df.rename(columns={'mass (g)': 'mass',
                   'nametype': 'status',
                   'fall': 'detection method',
                   'reclat': 'latitude',
                   'reclong': 'longitude',
                   'recclass': 'recommended classification'}, inplace=True)
df.head()

Unnamed: 0,name,status,recommended classification,mass,detection method,year,latitude,longitude
0,Aachen,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.775,6.08333
1,Aarhus,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333
2,Abee,Valid,EH4,107000.0,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0
3,Acapulco,Valid,Acapulcoite,1914.0,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9
4,Achiras,Valid,L6,780.0,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95


Fixing the years

In [149]:
df["year"] = df['year'].str[:4]

Exploring the missing values:

In [150]:
df.isnull().sum()

name                             0
status                           0
recommended classification       0
mass                           131
detection method                 0
year                           291
latitude                      7315
longitude                     7315
dtype: int64

Our visualization requires geolocation, hence we drop the meteorites with
missing latitude and longitude coordinates.

In [151]:
df = df[df['latitude'].notna()]
df.isnull().sum()

name                            0
status                          0
recommended classification      0
mass                          119
detection method                0
year                          178
latitude                        0
longitude                       0
dtype: int64

Year and mass are not as important for us, hence we decided to keep those rows.

#### Assigning an address to each meteorite landing found by the geo-locator with its coordinates.

##### Connection done using GeoPy's git-hub repository code:
https://github.com/geopy/geopy/blob/master/geopy/geocoders/mapbox.py

In [152]:
from functools import partial
from urllib.parse import quote, urlencode

from geopy.geocoders.base import DEFAULT_SENTINEL, Geocoder
from geopy.location import Location
from geopy.util import logger

__all__ = ("MapBox", )


class MapBox(Geocoder):
    """Geocoder using the Mapbox API.
    Documentation at:
        https://www.mapbox.com/api-documentation/
    """

    api_path = '/geocoding/v5/mapbox.places/%(query)s.json/'

    def __init__(
            self,
            api_key,
            *,
            scheme=None,
            timeout=DEFAULT_SENTINEL,
            proxies=DEFAULT_SENTINEL,
            user_agent=None,
            ssl_context=DEFAULT_SENTINEL,
            adapter_factory=None,
            domain='api.mapbox.com'
    ):
        """
        :param str api_key: The API key required by Mapbox to perform
            geocoding requests. API keys are managed through Mapox's account
            page (https://www.mapbox.com/account/access-tokens).
        :param str scheme:
            See :attr:`geopy.geocoders.options.default_scheme`.
        :param int timeout:
            See :attr:`geopy.geocoders.options.default_timeout`.
        :param dict proxies:
            See :attr:`geopy.geocoders.options.default_proxies`.
        :param str user_agent:
            See :attr:`geopy.geocoders.options.default_user_agent`.
        :type ssl_context: :class:`ssl.SSLContext`
        :param ssl_context:
            See :attr:`geopy.geocoders.options.default_ssl_context`.
        :param callable adapter_factory:
            See :attr:`geopy.geocoders.options.default_adapter_factory`.
            .. versionadded:: 2.0
        :param str domain: base api domain for mapbox
        """
        super().__init__(
            scheme=scheme,
            timeout=timeout,
            proxies=proxies,
            user_agent=user_agent,
            ssl_context=ssl_context,
            adapter_factory=adapter_factory,
        )
        self.api_key = api_key
        self.domain = domain.strip('/')
        self.api = "%s://%s%s" % (self.scheme, self.domain, self.api_path)

    def _parse_json(self, json, exactly_one=True):
        """Returns location, (latitude, longitude) from json feed."""
        features = json['features']
        if not features:
            return None

        def parse_feature(feature):
            location = feature['place_name']
            longitude = feature['geometry']['coordinates'][0]
            latitude = feature['geometry']['coordinates'][1]
            return Location(location, (latitude, longitude), feature)
        if exactly_one:
            return parse_feature(features[0])
        else:
            return [parse_feature(feature) for feature in features]


    def reverse(
            self,
            query,
            *,
            exactly_one=True,
            timeout=DEFAULT_SENTINEL
    ):
        """
        Return an address by location point.
        :param query: The coordinates for which you wish to obtain the
            closest human-readable addresses.
        :type query: :class:`geopy.point.Point`, list or tuple of ``(latitude,
            longitude)``, or string as ``"%(latitude)s, %(longitude)s"``.
        :param bool exactly_one: Return one result or a list of results, if
            available.
        :param int timeout: Time, in seconds, to wait for the geocoding service
            to respond before raising a :class:`geopy.exc.GeocoderTimedOut`
            exception. Set this only if you wish to override, on this call
            only, the value set during the geocoder's initialization.
        :rtype: ``None``, :class:`geopy.location.Location` or a list of them, if
            ``exactly_one=False``.
        """
        params = {'access_token': self.api_key}

        point = self._coerce_point_to_string(query, "%(lon)s,%(lat)s")
        quoted_query = quote(point.encode('utf-8'))
        url = "?".join((self.api % dict(query=quoted_query),
                        urlencode(params)))
        logger.debug("%s.reverse: %s", self.__class__.__name__, url)
        callback = partial(self._parse_json, exactly_one=exactly_one)
        return self._call_geocoder(url, callback, timeout=timeout)

In [153]:
from typing import Tuple
from typing import List
import time

mapbox: MapBox = MapBox('pk.eyJ1IjoiY2hyaXN0aWFuLXBhbGEiLCJhIjoiY2t5YTZ4ZjMwMDJyaDJ3cWx5bDY4amp6ZyJ9.uFBztOoKbsmrpYTqol-NXA')

points: pd.DataFrame = df[['latitude', 'longitude']]
point_list: List[Tuple] = [tuple(x) for x in points.to_numpy()]

batch_size: int = 800

for i in range(0, len(point_list), batch_size):
    time.sleep(2)
    batch = point_list[i:i+batch_size]
    for j in range(len(batch)):
        with open('meteorite_landings_addresses.txt', 'a', encoding='utf-8') as f:
            print(point_list[j], mapbox.reverse(point_list[j]),file = f)

### Processing the new dataset (after supervised)

## Preprocessing of world population dataset

### import the dataset

In [111]:
wp: pd.DataFrame = pd.read_csv("world_population.csv")

### brief exploration

In [112]:
print(wp.shape)
wp.dtypes

(266, 65)


Country Name       object
Country Code       object
Indicator Name     object
Indicator Code     object
1960              float64
                   ...   
2016              float64
2017              float64
2018              float64
2019              float64
2020              float64
Length: 65, dtype: object

In [113]:
wp.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Aruba,ABW,Population density (people per sq. km of land ...,EN.POP.DNST,,307.966667,312.411111,314.994444,316.827778,318.65,...,566.944444,569.805556,573.138889,576.533333,579.661111,582.583333,585.338889,588.033333,590.611111,593.144444
1,Africa Eastern and Southern,AFE,Population density (people per sq. km of land ...,EN.POP.DNST,,9.078367,9.312152,9.554904,9.806505,10.067014,...,35.979974,36.97425,37.995279,39.040385,40.107067,41.195114,42.303709,43.431367,44.576507,45.737916
2,Afghanistan,AFG,Population density (people per sq. km of land ...,EN.POP.DNST,,14.044981,14.323809,14.617529,14.926281,15.250311,...,46.1315,47.730567,49.428043,51.114793,52.712071,54.196961,55.59555,56.937049,58.269395,59.627395
3,Africa Western and Central,AFW,Population density (people per sq. km of land ...,EN.POP.DNST,,10.877837,11.10994,11.351399,11.601453,11.859717,...,39.829118,40.929916,42.056948,43.211639,44.394917,45.607045,46.847251,48.11408,49.405535,50.720167
4,Angola,AGO,Population density (people per sq. km of land ...,EN.POP.DNST,,4.436874,4.498676,4.555554,4.60014,4.628678,...,19.427817,20.139508,20.86772,21.61047,22.366552,23.135062,23.916555,24.713072,25.527632,26.362612


In [114]:
print(f"Number of duplicated fields in Indicator Name: {wp['Indicator Name'].duplicated().count()}")
print(f"Number of duplicated fields in Indicator Code: {wp['Indicator Code'].duplicated().count()}")

print(f"Number of observations in dataset: {len(wp)}")

Number of duplicated fields in Indicator Name: 266
Number of duplicated fields in Indicator Code: 266
Number of observations in dataset: 266


We have:
- Country Name
- Country Code
- Indicator Name
- Indicator Code
- population density for every year between 1960 and 2020

since we only want the data from 2020 we drop the other columns, and since indicator name and indicator code are useless for us we drop them too.

In [115]:
#drop columns
wp.drop(wp.iloc[:, 2:64], inplace = True, axis = 1)

In [116]:
wp.head()

Unnamed: 0,Country Name,Country Code,2020
0,Aruba,ABW,593.144444
1,Africa Eastern and Southern,AFE,45.737916
2,Afghanistan,AFG,59.627395
3,Africa Western and Central,AFW,50.720167
4,Angola,AGO,26.362612


### exploring missing values

In [117]:
wp.isnull().sum()

Country Name    0
Country Code    0
2020            2
dtype: int64

In [118]:
null_data = wp[wp.isnull().any(axis=1)]
null_data

Unnamed: 0,Country Name,Country Code,2020
110,Not classified,INX,
261,Kosovo,XKX,


In [119]:
# we drop "Not classified"

wp = wp[wp["Country Name"] != "Not classified"]

#### And for Kosovo we take the last available data

In [120]:
kv: pd.DataFrame = pd.read_csv("world_population.csv")
    
kv = kv[kv["Country Name"] == "Kosovo"]

kv

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
261,Kosovo,XKX,Population density (people per sq. km of land ...,EN.POP.DNST,,88.729678,91.301552,93.873427,96.445302,99.017176,...,164.508129,165.987508,166.998898,166.507853,164.250574,163.273353,164.508404,,,


#### Last available is 2017, so we take that

In [128]:
wp.iloc[260,2] = kv.iloc[0,61]

In [130]:
wp.isnull().sum()

Country Name    0
Country Code    0
2020            0
dtype: int64

### Now we have no null values, and the file is ready to use

In [135]:
wp.to_csv("world_pop.csv", sep=',', encoding='utf-8')

In [None]:
wp: pd.DataFrame = pd.read_csv("world_population.csv")
    
#drop columns of unused years
wp.drop(wp.iloc[:, 2:64], inplace = True, axis = 1)


# we drop "Not classified"

wp = wp[wp["Country Name"] != "Not classified"]