# Regression modeling in practice

## 1 - Writing about your data

**Author**: Juan Luis Cano Rodríguez

### Sample

This dataset contains $N = 384\,385$ individual Mars impact craters (despite the report saying it's 384 383) with a diameter greater or equal than one kilometer. 79 % of the craters have a diameter lower than 3 km ($N = 304\,490$), while the other 21 % have a diameter greater or equal than 3 km and were studied in more detail ($N = 79\,855$).

### Procedure

The craters were identified using a variety of instruments on board of several NASA spacecraft, including Viking MDIM, Mars Reconnaissance Orbiter CTX, Mars Global Surveyor MOLA, and specially Mars Odyssey THEMIS, which supported "the bulk of crater identification and classification". The images were then post-processed using ArcGIS, a geographic information system, to precisely locate the craters on the martian surface using manually selected points around their rim. Finally, all craters were fitted to circles and ellipses using a least squares approach written for the software Igor Pro. The companion paper of the database was published in 2012.

### Measures

Mainly geometrical properties of the craters were measured, including parameters of the fitted circles and ellipses (radius, major and minor axes), location on the martian surface (longitude and latitude) and elevation of the points that identify the crater rim. There are also notes about the morphology of the crater are included in free text form and, finally, the name of the crater is included when available.

## Supporting code

In [1]:
import numpy as np
import pandas as pd

from bokeh.io import output_notebook
from bokeh.plotting import figure, show

output_notebook()

In [2]:
DATASET_PATH = "RobbinsCraterDatabase_20121016.tab/RobbinsCraters_20121016.tab"

In [3]:
data = pd.read_table(DATASET_PATH, delimiter="\t", encoding='iso-8859-1', index_col="CRATER_ID")

  interactivity=interactivity, compiler=compiler, result=result)


How many records are there?

In [4]:
len(data)

384345

Let's explore **the first rows that have a non-empty crater name**, dropping the empty columns.

In [5]:
data.dropna(subset=["CRATER_NAME"]).head().dropna(axis=1, how='all')

Unnamed: 0_level_0,LATITUDE_CIRCLE_IMAGE,LONGITUDE_CIRCLE_IMAGE,LATITUDE_ELLIPSE_IMAGE,LONGITUDE_ELLIPSE_IMAGE,DIAM_CIRCLE_IMAGE,DIAM_CIRCLE_SD_IMAGE,DIAM_ELLIPSE_MAJOR_IMAGE,DIAM_ELLIPSE_MINOR_IMAGE,DIAM_ELLIPSE_ECCEN_IMAGE,DIAM_ELLIPSE_ELLIP_IMAGE,...,MORPHOLOGY_EJECTA_1,MORPHOLOGY_EJECTA_2,DEGRADATION_STATE,CONFIDENCE_IMPACT_CRATER,LAYER_1_PERIMETER,LAYER_1_AREA,LAYER_1_LOBATENESS,LAYER_1_EJECTARAD_EQUIV,LAYER_1_EJECTARAD_REL,CRATER_NAME
CRATER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01-000001,72.76,164.464,72.784,164.464,82.02,0.09,84.17,79.91,0.31,1.05,...,Rd/MLERS,HuBL,3.0,4,,,,,,Korolev
01-000012,77.17,-145.681,77.165,-145.681,51.08,0.05,51.77,50.43,0.23,1.03,...,Rd,,3.0,4,,,,,,Dokka
01-000022,81.925,76.714,81.984,76.711,43.57,,43.81,42.91,0.2,1.02,...,,,1.0,4,,,,,,Udzha
01-000028,70.173,103.226,70.169,103.226,36.28,,36.74,35.83,0.22,1.03,...,SLERS,HuBL,3.0,4,,,,,,Louth
01-000068,76.887,-54.969,76.889,-54.969,22.11,0.06,22.92,21.34,0.36,1.07,...,SLEPd,HuBL,3.0,4,377.56,2641.22,1.94,19.98,1.81,Escorial


How elliptical are the craters?

In [6]:
data.DIAM_ELLIPSE_ECCEN_IMAGE.describe()

count    384336.000000
mean          0.424874
std           0.134485
min           0.020000
25%           0.330000
50%           0.420000
75%           0.510000
max           0.980000
Name: DIAM_ELLIPSE_ECCEN_IMAGE, dtype: float64

In [7]:
diameters = data.DIAM_ELLIPSE_ECCEN_IMAGE.dropna()
hist, edges = np.histogram(diameters, density=True)

p = figure(width=400, height=400)
p.quad(top=hist, left=edges[:-1], right=edges[1:])
show(p)

How many craters were below and above 3 km of diameter? What's the minimum diameter?

In [8]:
print("{:.2f} % below 3 km (N = {})".format(len(data[data.DIAM_CIRCLE_IMAGE < 3]) / len(data) * 100, len(data[data.DIAM_CIRCLE_IMAGE < 3])))

79.22 % below 3 km (N = 304490)


In [9]:
print("{:.2f} % above 3 km (N = {})".format(len(data[data.DIAM_CIRCLE_IMAGE >= 3]) / len(data) * 100, len(data[data.DIAM_CIRCLE_IMAGE >= 3])))

20.78 % above 3 km (N = 79855)


In [10]:
data.DIAM_CIRCLE_IMAGE.min()

1.0