# Regression modeling in practice

## 1 - Writing about your data

> **Step 1**: Describe your sample. Provide enough detail so that your reader can clearly understand the population that the study sample came from. Use meaningful labels. Do not use abbreviations (“PPM100”) or variable names.

> a) Describe the study population (who or what was studied).

> b) Report the level of analysis studied (individual, group, or aggregate).

> c) Report the number of observations in the data set.

> d) Describe your data analytic sample (the sample you are using for your analyses).

This dataset contains 384 385 Mars impact craters (despite the report saying it's 384 383) with a diameter greater or equal than one kilometer. X % of the craters have a diameter lower than 3 km, while the other Y % have a diameter greater or equal than 3 km and were studied in more detail.

In [1]:
import pandas as pd

from bokeh.io import output_notebook
from bokeh.plotting import figure, show

output_notebook()

In [2]:
DATASET_PATH = "RobbinsCraterDatabase_20121016.tab/RobbinsCraters_20121016.tab"

In [20]:
data = pd.read_table(DATASET_PATH, delimiter="\t", encoding='iso-8859-1', index_col="CRATER_ID")

  interactivity=interactivity, compiler=compiler, result=result)


How many records are there?

In [21]:
len(data)

384345

Let's explore **the first rows that have a non-empty crater name**, dropping the empty columns.

In [22]:
data.dropna(subset=["CRATER_NAME"]).head().dropna(axis=1, how='all')

Unnamed: 0_level_0,LATITUDE_CIRCLE_IMAGE,LONGITUDE_CIRCLE_IMAGE,LATITUDE_ELLIPSE_IMAGE,LONGITUDE_ELLIPSE_IMAGE,DIAM_CIRCLE_IMAGE,DIAM_CIRCLE_SD_IMAGE,DIAM_ELLIPSE_MAJOR_IMAGE,DIAM_ELLIPSE_MINOR_IMAGE,DIAM_ELLIPSE_ECCEN_IMAGE,DIAM_ELLIPSE_ELLIP_IMAGE,...,MORPHOLOGY_EJECTA_1,MORPHOLOGY_EJECTA_2,DEGRADATION_STATE,CONFIDENCE_IMPACT_CRATER,LAYER_1_PERIMETER,LAYER_1_AREA,LAYER_1_LOBATENESS,LAYER_1_EJECTARAD_EQUIV,LAYER_1_EJECTARAD_REL,CRATER_NAME
CRATER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01-000001,72.76,164.464,72.784,164.464,82.02,0.09,84.17,79.91,0.31,1.05,...,Rd/MLERS,HuBL,3.0,4,,,,,,Korolev
01-000012,77.17,-145.681,77.165,-145.681,51.08,0.05,51.77,50.43,0.23,1.03,...,Rd,,3.0,4,,,,,,Dokka
01-000022,81.925,76.714,81.984,76.711,43.57,,43.81,42.91,0.2,1.02,...,,,1.0,4,,,,,,Udzha
01-000028,70.173,103.226,70.169,103.226,36.28,,36.74,35.83,0.22,1.03,...,SLERS,HuBL,3.0,4,,,,,,Louth
01-000068,76.887,-54.969,76.889,-54.969,22.11,0.06,22.92,21.34,0.36,1.07,...,SLEPd,HuBL,3.0,4,377.56,2641.22,1.94,19.98,1.81,Escorial


How elliptical are the craters?

In [23]:
data.DIAM_ELLIPSE_ECCEN_IMAGE.describe()

count    384336.000000
mean          0.424874
std           0.134485
min           0.020000
25%           0.330000
50%           0.420000
75%           0.510000
max           0.980000
Name: DIAM_ELLIPSE_ECCEN_IMAGE, dtype: float64

In [28]:
diameters = data['DIAM_ELLIPSE_ECCEN_IMAGE'].dropna()
hist, edges = np.histogram(diameters, density=True)

p = figure(width=400, height=400)
p.quad(top=hist, left=edges[:-1], right=edges[1:])
show(p)

How many craters were below and above 3 km of diameter?

In [35]:
len(data[data['DIAM_CIRCLE_IMAGE'] < 3]) / len(data)

0.7922309383496599

In [36]:
len(data[data['DIAM_CIRCLE_IMAGE'] >= 3]) / len(data)

0.20776906165034018