# Data Mining and Visualisation -- Lab 1

Welcome to **JC3503: Data Mining and Visualisation!**

Throughout the labs, you will be asked to explore, visualise, and analyse datasets of your choosing. 

To get started, we will outline some useful packages in Python, and cover some of the Getting Started guides for them. If they include information that you aren't familiar with, don't worry! You can always come back to these guides as you progress through the course. If you've already covered any of these packages previously, feel free to skip these. Alternatively, you can use this notebook to test that your packages are all working properly.

You should work through these Getting Started guides to get familiar with some of the features of the packages. Once finished, you should then explore Kaggle to find some interesting datasets that you might be interested in using for future labs.

## Python Packages

There are various packages available for data mining and visualisation in Python. We will focus on a selection that are particularly popular and useful for conducting a broad range of analyses.

Packages we will use:
- [Pandas](###Pandas)
- [Seaborn](###Seaborn)
- [scikit-learn](###scikit-learn)

### Pandas

This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the [Cookbook](https://pandas.pydata.org/docs/user_guide/cookbook.html#cookbook).

Customarily, we import as follows:

In [10]:
import numpy as np
import pandas as pd

#### Basic data structures in pandas

Pandas provides two types of classes for handling data:

1. [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series): a one-dimensional labeled array holding data of any type

    such as integers, strings, Python objects etc.

2. [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame): a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

#### Object creation

See the [Intro to data structures](https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro) section.

Creating a [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series) by passing a list of values, letting pandas create a default [RangeIndex](https://pandas.pydata.org/docs/reference/api/pandas.RangeIndex.html#pandas.RangeIndex).

In [11]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array with a datetime index using `date_range()` and labeled columns:

In [12]:
dates = pd.date_range("20130101", periods=6)

dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [13]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

df

Unnamed: 0,A,B,C,D
2013-01-01,-1.670804,0.827831,-0.507535,0.515955
2013-01-02,0.306263,-0.841184,0.417032,1.587285
2013-01-03,-0.944156,-0.465254,-0.239475,-0.853602
2013-01-04,1.048471,0.446681,-0.462223,0.472819
2013-01-05,0.853756,-1.653521,0.235151,-0.716782
2013-01-06,1.608412,0.422529,0.523581,1.567412


Creating a DataFrame by passing a dictionary of objects where the keys are the column labels and the values are the column values.

In [14]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different `dtypes`:

In [15]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

#### Viewing data

See the [Essentially basics functionality](https://pandas.pydata.org/docs/user_guide/basics.html#basics) section.

Use `DataFrame.head()` and `DataFrame.tail()` to view the top and bottom rows of the frame respectively:

In [16]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-1.670804,0.827831,-0.507535,0.515955
2013-01-02,0.306263,-0.841184,0.417032,1.587285
2013-01-03,-0.944156,-0.465254,-0.239475,-0.853602
2013-01-04,1.048471,0.446681,-0.462223,0.472819
2013-01-05,0.853756,-1.653521,0.235151,-0.716782


In [17]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,1.048471,0.446681,-0.462223,0.472819
2013-01-05,0.853756,-1.653521,0.235151,-0.716782
2013-01-06,1.608412,0.422529,0.523581,1.567412


Display the DataFrame.index or DataFrame.columns:

In [18]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [19]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

Return a NumPy representation of the underlying data with `DataFrame.to_numpy()` without the index or column labels:

In [20]:
df.to_numpy()

array([[-1.67080443,  0.82783052, -0.50753532,  0.51595481],
       [ 0.30626281, -0.84118409,  0.41703166,  1.58728544],
       [-0.94415553, -0.46525418, -0.23947506, -0.85360175],
       [ 1.04847085,  0.44668129, -0.46222338,  0.47281859],
       [ 0.85375563, -1.65352147,  0.23515088, -0.71678194],
       [ 1.6084122 ,  0.42252871,  0.52358115,  1.56741205]])

**Note:** NumPy arrays have one dtype for the entire array while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. If the common data type is object, DataFrame.to_numpy() will require copying data.

In [21]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [22]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

describe() shows a quick statistic summary of your data:

In [23]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.200324,-0.210487,-0.005578,0.428848
std,1.261048,0.944004,0.454258,1.05877
min,-1.670804,-1.653521,-0.507535,-0.853602
25%,-0.631551,-0.747202,-0.406536,-0.419382
50%,0.580009,-0.021363,-0.002162,0.494387
75%,0.999792,0.440643,0.371561,1.304548
max,1.608412,0.827831,0.523581,1.587285


Transposing your data:

In [24]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-1.670804,0.306263,-0.944156,1.048471,0.853756,1.608412
B,0.827831,-0.841184,-0.465254,0.446681,-1.653521,0.422529
C,-0.507535,0.417032,-0.239475,-0.462223,0.235151,0.523581
D,0.515955,1.587285,-0.853602,0.472819,-0.716782,1.567412


DataFrame.sort_index() sorts by an axis:

In [25]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,0.515955,-0.507535,0.827831,-1.670804
2013-01-02,1.587285,0.417032,-0.841184,0.306263
2013-01-03,-0.853602,-0.239475,-0.465254,-0.944156
2013-01-04,0.472819,-0.462223,0.446681,1.048471
2013-01-05,-0.716782,0.235151,-1.653521,0.853756
2013-01-06,1.567412,0.523581,0.422529,1.608412


DataFrame.sort_values() sorts by values:

In [26]:
df.sort_values(by="B")

Unnamed: 0,A,B,C,D
2013-01-05,0.853756,-1.653521,0.235151,-0.716782
2013-01-02,0.306263,-0.841184,0.417032,1.587285
2013-01-03,-0.944156,-0.465254,-0.239475,-0.853602
2013-01-06,1.608412,0.422529,0.523581,1.567412
2013-01-04,1.048471,0.446681,-0.462223,0.472819
2013-01-01,-1.670804,0.827831,-0.507535,0.515955


#### Selection

Note: While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `DataFrame.at()`, `DataFrame.iat()`, `DataFrame.loc()` and `DataFrame.iloc()`.

See the indexing documentation [Indexing and Selecting Data](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing) and [MultiIndex / Advanced Indexing](https://pandas.pydata.org/docs/user_guide/advanced.html#advanced).


##### Getitem ([])

For a DataFrame, passing a single label selects a columns and yields a Series equivalent to df.A:

In [27]:
df["A"]

2013-01-01   -1.670804
2013-01-02    0.306263
2013-01-03   -0.944156
2013-01-04    1.048471
2013-01-05    0.853756
2013-01-06    1.608412
Freq: D, Name: A, dtype: float64

For a DataFrame, passing a slice : selects matching rows:

In [28]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-1.670804,0.827831,-0.507535,0.515955
2013-01-02,0.306263,-0.841184,0.417032,1.587285
2013-01-03,-0.944156,-0.465254,-0.239475,-0.853602


##### Selection by label

See more in [Selection by Label](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-label) using `DataFrame.loc()` or `DataFrame.at()`.

Selecting a row matching a label:

In [29]:
df.loc[dates[0]]

A   -1.670804
B    0.827831
C   -0.507535
D    0.515955
Name: 2013-01-01 00:00:00, dtype: float64

Selecting all rows (`:`) with a select column labels:

In [30]:
df.loc[:, ["A", "B"]]

Unnamed: 0,A,B
2013-01-01,-1.670804,0.827831
2013-01-02,0.306263,-0.841184
2013-01-03,-0.944156,-0.465254
2013-01-04,1.048471,0.446681
2013-01-05,0.853756,-1.653521
2013-01-06,1.608412,0.422529


For label slicing, both endpoints are included:

In [31]:
df.loc["20130102":"20130104", ["A", "B"]]

Unnamed: 0,A,B
2013-01-02,0.306263,-0.841184
2013-01-03,-0.944156,-0.465254
2013-01-04,1.048471,0.446681


Selecting a single row and column label returns a scalar:

In [32]:
df.loc[dates[0], "A"]

-1.6708044294824267

For getting fast access to a scalar (equivalent to the prior method):

In [33]:
df.at[dates[0], "A"]

-1.6708044294824267

##### Selection by position

See more in [Selection by Position](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-integer) using `DataFrame.iloc()` or `DataFrame.iat()`.

Select via the position of the passed integers:

In [34]:
df.iloc[3]

A    1.048471
B    0.446681
C   -0.462223
D    0.472819
Name: 2013-01-04 00:00:00, dtype: float64

Integer slices acts similar to NumPy/Python:

In [35]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2013-01-04,1.048471,0.446681
2013-01-05,0.853756,-1.653521


Lists of integer position locations:

In [36]:
df.iloc[[1, 2, 4], [0, 2]]

Unnamed: 0,A,C
2013-01-02,0.306263,0.417032
2013-01-03,-0.944156,-0.239475
2013-01-05,0.853756,0.235151


For slicing rows explicitly:

In [37]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2013-01-02,0.306263,-0.841184,0.417032,1.587285
2013-01-03,-0.944156,-0.465254,-0.239475,-0.853602


For slicing columns explicitly:

In [38]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2013-01-01,0.827831,-0.507535
2013-01-02,-0.841184,0.417032
2013-01-03,-0.465254,-0.239475
2013-01-04,0.446681,-0.462223
2013-01-05,-1.653521,0.235151
2013-01-06,0.422529,0.523581


For getting a value explicitly:

In [39]:
df.iloc[1, 1]

-0.8411840889953582

For getting fast access to a scalar (equivalent to the prior method):

In [40]:
df.iat[1, 1]

-0.8411840889953582

##### Boolean indexing

Select rows where `df.A` is greater than `0`.

In [41]:
df[df["A"] > 0]

Unnamed: 0,A,B,C,D
2013-01-02,0.306263,-0.841184,0.417032,1.587285
2013-01-04,1.048471,0.446681,-0.462223,0.472819
2013-01-05,0.853756,-1.653521,0.235151,-0.716782
2013-01-06,1.608412,0.422529,0.523581,1.567412


Selecting values from a DataFrame where a boolean condition is met:

In [42]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,0.827831,,0.515955
2013-01-02,0.306263,,0.417032,1.587285
2013-01-03,,,,
2013-01-04,1.048471,0.446681,,0.472819
2013-01-05,0.853756,,0.235151,
2013-01-06,1.608412,0.422529,0.523581,1.567412


Using `isin()` method for filtering:

In [43]:
df2 = df.copy()

df2["E"] = ["one", "one", "two", "three", "four", "three"]

df2

Unnamed: 0,A,B,C,D,E
2013-01-01,-1.670804,0.827831,-0.507535,0.515955,one
2013-01-02,0.306263,-0.841184,0.417032,1.587285,one
2013-01-03,-0.944156,-0.465254,-0.239475,-0.853602,two
2013-01-04,1.048471,0.446681,-0.462223,0.472819,three
2013-01-05,0.853756,-1.653521,0.235151,-0.716782,four
2013-01-06,1.608412,0.422529,0.523581,1.567412,three


In [44]:
df2[df2["E"].isin(["two", "four"])]

Unnamed: 0,A,B,C,D,E
2013-01-03,-0.944156,-0.465254,-0.239475,-0.853602,two
2013-01-05,0.853756,-1.653521,0.235151,-0.716782,four


##### Setting

Setting a new column automatically aligns the data by the indexes:

In [45]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))

s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [46]:
df["F"] = s1

Setting values by label:

In [47]:
df.at[dates[0], "A"] = 0

Setting values by position:

In [48]:
df.iat[0, 1] = 0

Setting by assigning with a NumPy array:

In [49]:
df.loc[:, "D"] = np.array([5] * len(df))

The result of the prior setting operations:

In [50]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.507535,5.0,
2013-01-02,0.306263,-0.841184,0.417032,5.0,1.0
2013-01-03,-0.944156,-0.465254,-0.239475,5.0,2.0
2013-01-04,1.048471,0.446681,-0.462223,5.0,3.0
2013-01-05,0.853756,-1.653521,0.235151,5.0,4.0
2013-01-06,1.608412,0.422529,0.523581,5.0,5.0


A `where` operation with setting:

In [51]:
df2 = df.copy()

df2[df2 > 0] = -df2

df2

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.507535,-5.0,
2013-01-02,-0.306263,-0.841184,-0.417032,-5.0,-1.0
2013-01-03,-0.944156,-0.465254,-0.239475,-5.0,-2.0
2013-01-04,-1.048471,-0.446681,-0.462223,-5.0,-3.0
2013-01-05,-0.853756,-1.653521,-0.235151,-5.0,-4.0
2013-01-06,-1.608412,-0.422529,-0.523581,-5.0,-5.0


#### Missing data

For NumPy data types, `np.nan` represents missing data. It is by default not included in computations. See the [Missing Data](https://pandas.pydata.org/docs/user_guide/missing_data.html#missing-data) section.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:

In [52]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])

df1.loc[dates[0] : dates[1], "E"] = 1

df1

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,-0.507535,5.0,,1.0
2013-01-02,0.306263,-0.841184,0.417032,5.0,1.0,1.0
2013-01-03,-0.944156,-0.465254,-0.239475,5.0,2.0,
2013-01-04,1.048471,0.446681,-0.462223,5.0,3.0,


`DataFrame.dropna()` drops any rows that have missing data:

In [53]:
df1.dropna(how="any")

Unnamed: 0,A,B,C,D,F,E
2013-01-02,0.306263,-0.841184,0.417032,5.0,1.0,1.0


`DataFrame.fillna()` fills missing data:

In [54]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,-0.507535,5.0,5.0,1.0
2013-01-02,0.306263,-0.841184,0.417032,5.0,1.0,1.0
2013-01-03,-0.944156,-0.465254,-0.239475,5.0,2.0,5.0
2013-01-04,1.048471,0.446681,-0.462223,5.0,3.0,5.0


isna() gets the boolean mask where values are nan:

In [55]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,False,False,False,False,True,False
2013-01-02,False,False,False,False,False,False
2013-01-03,False,False,False,False,False,True
2013-01-04,False,False,False,False,False,True


#### Operations

See the [Basic section on Binary Ops](https://pandas.pydata.org/docs/user_guide/basics.html#basics-binop).


##### Stats

Operations in general exclude missing data.

Calculate the mean value for each column:

In [56]:
df.mean()

A    0.478791
B   -0.348458
C   -0.005578
D    5.000000
F    3.000000
dtype: float64

Calculate the mean value for each row:

In [57]:
df.mean(axis=1)

2013-01-01    1.123116
2013-01-02    1.176422
2013-01-03    1.070223
2013-01-04    1.806586
2013-01-05    1.687077
2013-01-06    2.510904
Freq: D, dtype: float64

Operating with another Series or DataFrame with a different index or column will align the result with the union of the index or column labels. In addition, pandas automatically broadcasts along the specified dimension and will fill unaligned labels with `np.nan`.

In [58]:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)

s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [59]:
df.sub(s, axis="index")

Unnamed: 0,A,B,C,D,F
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,-1.944156,-1.465254,-1.239475,4.0,1.0
2013-01-04,-1.951529,-2.553319,-3.462223,2.0,0.0
2013-01-05,-4.146244,-6.653521,-4.764849,0.0,-1.0
2013-01-06,,,,,


##### User defined functions

`DataFrame.agg()` and `DataFrame.transform()` applies a user defined function that reduces or broadcasts its result respectively.

In [60]:
df.agg(lambda x: np.mean(x) * 5.6)

A     2.681230
B    -1.951366
C    -0.031239
D    28.000000
F    16.800000
dtype: float64

In [61]:
df.transform(lambda x: x * 101.2)

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-51.362574,506.0,
2013-01-02,30.993796,-85.12783,42.203604,506.0,101.2
2013-01-03,-95.548539,-47.083723,-24.234876,506.0,202.4
2013-01-04,106.10525,45.204146,-46.777006,506.0,303.6
2013-01-05,86.400069,-167.336372,23.797269,506.0,404.8
2013-01-06,162.771315,42.759906,52.986412,506.0,506.0


##### Value Counts

See more at [Histogramming and Discretization](https://pandas.pydata.org/docs/user_guide/basics.html#basics-discretization).

In [62]:
s = pd.Series(np.random.randint(0, 7, size=10))

s

0    3
1    5
2    5
3    2
4    6
5    2
6    3
7    0
8    6
9    0
dtype: int32

In [63]:
s.value_counts()

3    2
5    2
2    2
6    2
0    2
Name: count, dtype: int64

##### String Methods

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. See more at Vectorized String Methods.

In [64]:
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])

s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

#### Merge
##### Concat

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

See the [Merging](https://pandas.pydata.org/docs/user_guide/merging.html#merging) section.

Concatenating pandas objects together row-wise with `concat()`:

In [65]:
df = pd.DataFrame(np.random.randn(10, 4))

df

Unnamed: 0,0,1,2,3
0,-0.225088,1.415321,0.074033,-0.843676
1,-0.838275,0.308507,0.38842,1.754373
2,-0.470032,-1.220458,0.380402,-1.596054
3,0.291188,0.742329,0.832999,1.197853
4,0.118802,-1.497275,-0.604917,1.465306
5,-0.168138,0.330202,-2.166223,0.490461
6,0.195747,0.244266,1.696696,1.068838
7,1.0769,-0.515225,1.194313,-0.623154
8,0.166628,2.261204,-1.411812,1.707294
9,0.524385,1.515111,-0.88965,-3.434654


In [66]:
# break it into pieces

pieces = [df[:3], df[3:7], df[7:]]

pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,-0.225088,1.415321,0.074033,-0.843676
1,-0.838275,0.308507,0.38842,1.754373
2,-0.470032,-1.220458,0.380402,-1.596054
3,0.291188,0.742329,0.832999,1.197853
4,0.118802,-1.497275,-0.604917,1.465306
5,-0.168138,0.330202,-2.166223,0.490461
6,0.195747,0.244266,1.696696,1.068838
7,1.0769,-0.515225,1.194313,-0.623154
8,0.166628,2.261204,-1.411812,1.707294
9,0.524385,1.515111,-0.88965,-3.434654


Note: Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it.

##### Join

`merge()` enables SQL style join types along specific columns. See the [Database style joining](https://pandas.pydata.org/docs/user_guide/merging.html#merging-join) section.

In [67]:
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

left

Unnamed: 0,key,lval
0,foo,1
1,foo,2


In [68]:
pd.merge(left, right, on="key")

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


`merge()` on unique keys:

In [69]:
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

left

Unnamed: 0,key,lval
0,foo,1
1,bar,2


In [70]:
right

Unnamed: 0,key,rval
0,foo,4
1,bar,5


In [71]:
pd.merge(left, right, on="key")

Unnamed: 0,key,lval,rval
0,foo,1,4
1,bar,2,5


Grouping

By “group by” we are referring to a process involving one or more of the following steps:
- **Splitting** the data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** the results into a data structure

See the [Grouping](https://pandas.pydata.org/docs/user_guide/groupby.html#groupby) section.

In [72]:
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

df

Unnamed: 0,A,B,C,D
0,foo,one,0.255474,-0.045628
1,bar,one,0.959611,1.4479
2,foo,two,1.319942,0.627757
3,bar,three,1.562338,0.337114
4,foo,two,-0.370815,-0.363726
5,bar,two,-1.524314,1.459675
6,foo,one,1.369581,1.216648
7,foo,three,-2.148407,2.013442


Grouping by a column label, selecting column labels, and then applying the `DataFrameGroupBy.sum()` function to the resulting groups:

In [73]:
df.groupby("A")[["C", "D"]].sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.997636,3.244688
foo,0.425776,3.448492


Grouping by multiple columns label forms `MultiIndex`.

In [74]:
df.groupby(["A", "B"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.959611,1.4479
bar,three,1.562338,0.337114
bar,two,-1.524314,1.459675
foo,one,1.625056,1.171019
foo,three,-2.148407,2.013442
foo,two,0.949127,0.264031


#### Reshaping

See the sections on Hierarchical Indexing and Reshaping.

##### Stack

In [75]:
arrays = [
   ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ["one", "two", "one", "two", "one", "two", "one", "two"],
]

index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

df2 = df[:4]

df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,2.759498,1.069632
bar,two,0.871124,-0.391132
baz,one,1.396013,-0.205075
baz,two,0.349268,0.127466


The `stack()` method “compresses” a level in the DataFrame’s columns:

In [76]:
stacked = df2.stack(future_stack=True)

stacked

TypeError: stack() got an unexpected keyword argument 'future_stack'

With a “stacked” DataFrame or Series (having a `MultiIndex` as the `index`), the inverse operation of `stack()` is `unstack()`, which by default unstacks the last level:

In [77]:
stacked.unstack()

NameError: name 'stacked' is not defined

In [78]:
stacked.unstack(1)

NameError: name 'stacked' is not defined

In [79]:
stacked.unstack(0)

NameError: name 'stacked' is not defined

##### Pivot tables

See the section on [Pivot Tables](https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-pivot).

In [80]:
df = pd.DataFrame(
    {
        "A": ["one", "one", "two", "three"] * 3,
        "B": ["A", "B", "C"] * 4,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
        "D": np.random.randn(12),
        "E": np.random.randn(12),
    }
)

df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,-1.335344,0.484721
1,one,B,foo,-0.606958,-0.490103
2,two,C,foo,0.686005,-0.043847
3,three,A,bar,2.602712,-1.002015
4,one,B,bar,-0.209136,-0.541507
5,one,C,bar,0.216401,-0.062737
6,two,A,foo,0.421082,-0.934131
7,three,B,foo,0.032949,0.643853
8,one,C,foo,2.432981,-0.870689
9,one,A,bar,-0.327014,1.243758


`pivot_table()` pivots a `DataFrame` specifying the `values`, `index` and `columns`

In [81]:
pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-0.327014,-1.335344
one,B,-0.209136,-0.606958
one,C,0.216401,2.432981
three,A,2.602712,
three,B,,0.032949
three,C,0.845979,
two,A,,0.421082
two,B,1.004911,
two,C,,0.686005


#### Time series

pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.

In [82]:
rng = pd.date_range("1/1/2012", periods=100, freq="s")

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

ts.resample("5Min").sum()

2012-01-01    23300
Freq: 5T, dtype: int32

`Series.tz_localize()` localizes a time series to a time zone:

In [83]:
rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D")

ts = pd.Series(np.random.randn(len(rng)), rng)

ts

2012-03-06   -0.714988
2012-03-07   -0.085584
2012-03-08   -1.795465
2012-03-09   -1.088295
2012-03-10    0.890735
Freq: D, dtype: float64

In [84]:
ts_utc = ts.tz_localize("UTC")

ts_utc

2012-03-06 00:00:00+00:00   -0.714988
2012-03-07 00:00:00+00:00   -0.085584
2012-03-08 00:00:00+00:00   -1.795465
2012-03-09 00:00:00+00:00   -1.088295
2012-03-10 00:00:00+00:00    0.890735
Freq: D, dtype: float64

`Series.tz_convert()` converts a timezones aware time series to another time zone:

In [85]:
ts_utc.tz_convert("US/Eastern")

2012-03-05 19:00:00-05:00   -0.714988
2012-03-06 19:00:00-05:00   -0.085584
2012-03-07 19:00:00-05:00   -1.795465
2012-03-08 19:00:00-05:00   -1.088295
2012-03-09 19:00:00-05:00    0.890735
Freq: D, dtype: float64

Adding a non-fixed duration (BusinessDay) to a time series:

In [86]:
rng

DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
               '2012-03-10'],
              dtype='datetime64[ns]', freq='D')

In [87]:
rng + pd.offsets.BusinessDay(5)

DatetimeIndex(['2012-03-13', '2012-03-14', '2012-03-15', '2012-03-16',
               '2012-03-16'],
              dtype='datetime64[ns]', freq=None)

#### Categoricals

pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.

In [88]:
df = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)

Converting the raw grades to a categorical data type:

In [89]:
df["grade"] = df["raw_grade"].astype("category")

df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

Rename the categories to more meaningful names:

In [90]:
new_categories = ["very good", "good", "very bad"]

df["grade"] = df["grade"].cat.rename_categories(new_categories)

Reorder the categories and simultaneously add the missing categories (methods under `Series.cat()` return a new Series by default):

In [91]:
df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"]
)

df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

Sorting is per order in the categories, not lexical order:

In [92]:
df.sort_values(by="grade")

Unnamed: 0,id,raw_grade,grade
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good
4,5,a,very good


Grouping by a categorical column with observed=False also shows empty categories:

In [93]:
df.groupby("grade", observed=False).size()

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

#### Importing and exporting data

See the [IO Tools](https://pandas.pydata.org/docs/user_guide/io.html#io) section.

##### CSV

[Writing to a csv file:](https://pandas.pydata.org/docs/user_guide/io.html#io-store-in-csv) using `DataFrame.to_csv()`

In [94]:
df = pd.DataFrame(np.random.randint(0, 5, (10, 5)))

df.to_csv("foo.csv")

[Reading from a csv file:](https://pandas.pydata.org/docs/user_guide/io.html#io-read-csv-table) using `read_csv()`

In [95]:
pd.read_csv("foo.csv")

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4
0,0,3,2,3,1,3
1,1,3,3,2,0,2
2,2,0,4,4,4,1
3,3,4,1,2,2,4
4,4,1,2,2,0,1
5,5,2,3,2,3,2
6,6,0,2,4,0,1
7,7,4,0,2,1,3
8,8,4,0,4,3,1
9,9,4,3,1,0,1


### Seaborn

Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures.

Seaborn helps you explore and understand your data. Its plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots. Its dataset-oriented, declarative API lets you focus on what the different elements of your plots mean, rather than on the details of how to draw them.

Let's get started with some basic functionality!



By convention, Seaborn is imported with the shorthand ```sns```.

In [96]:
import seaborn as sns

next, we will set the theme of our plots to the default option. This uses the matplotlib rcParam system and will affect how all matplotlib plots look, even if you don’t make them with seaborn. 

Beyond the default theme, there are [several other options](https://seaborn.pydata.org/tutorial/aesthetics.html), and you can independently control the style and scaling of the plot to quickly translate your work between presentation contexts (e.g., making a version of your figure that will have readable fonts when projected during a talk). 

In [97]:
sns.set_theme()

Note that sometimes in Jupyter, plots won't render after running the code more than once. The following code helps with this, making sure that they always show.

In [98]:
# Make sure that our plots are visualised in jupyter
%matplotlib inline

Seaborn comes with several useful built-in datasets that we can access and use. Let's get started by loading one of these datasets: The ```tips``` dataset.

In [99]:
# Load an example dataset
tips = sns.load_dataset("tips")

URLError: <urlopen error [Errno 11004] getaddrinfo failed>

We can then use this dataset to create a visualisation.

In [None]:
# Create a visualization
sns.relplot(
    data=tips,
    x="total_bill", y="tip", col="time",
    hue="smoker", style="smoker", size="size"
)

This plot shows the relationship between five variables in the tips dataset using a single call to the seaborn function relplot(). Notice how we provided only the names of the variables and their roles in the plot. Unlike when using matplotlib directly, it wasn’t necessary to specify attributes of the plot elements in terms of the color values or marker codes. Behind the scenes, seaborn handled the translation from values in the dataframe to arguments that matplotlib understands. This declarative approach lets you stay focused on the questions that you want to answer, rather than on the details of how to control matplotlib.

#### A high-level API for statistical graphics

There is no universally best way to visualize data. Different questions are best answered by different plots. Seaborn makes it easy to switch between different visual representations by using a consistent dataset-oriented API.

The function ```relplot()``` is named that way because it is designed to visualize many different statistical relationships. While scatter plots are often effective, relationships where one variable represents a measure of time are better represented by a line. The ```relplot()``` function has a convenient ```kind``` parameter that lets you easily switch to this alternate representation:

In [None]:
dots = sns.load_dataset("dots")
sns.relplot(
    data=dots, x="time", y="firing_rate", 
    kind="line", col="align",
    hue="choice", size="coherence", style="choice",
    facet_kws=dict(sharex=False)
)

Notice how the ```size``` and ```style``` parameters are used in both the scatter and line plots, but they affect the two visualizations differently: changing the marker area and symbol in the scatter plot vs the line width and dashing in the line plot. We did not need to keep those details in mind, letting us focus on the overall structure of the plot and the information we want it to convey.

#### Statistical estimation

Often, we are interested in the average value of one variable as a function of other variables. Many seaborn functions will automatically perform the statistical estimation that is necessary to answer these questions:

In [None]:
fmri = sns.load_dataset("fmri")
sns.relplot(
    data=fmri, kind="line",
    x="timepoint", y="signal", col="region",
    hue="event", style="event"
)

When statistical values are estimated, seaborn will use bootstrapping to compute confidence intervals and draw error bars representing the uncertainty of the estimate.

Statistical estimation in seaborn goes beyond descriptive statistics. For example, it is possible to enhance a scatterplot by including a linear regression model (and its uncertainty) using ```lmplot()```:

In [None]:
sns.lmplot(data=tips, x="total_bill", y="tip", col="time", hue="smoker")

#### Distributional representations

Statistical analyses require knowledge about the distribution of variables in your dataset. The seaborn function ```displot()``` supports several approaches to visualizing distributions. These include classic techniques like histograms and computationally-intensive approaches like kernel density estimation:

In [None]:
sns.displot(data=tips, x="total_bill", col="time", kde=True)

Seaborn also tries to promote techniques that are powerful but less familiar, such as calculating and plotting the empirical cumulative distribution function of the data:

In [None]:
sns.displot(data=tips, kind="ecdf", x="total_bill", col="time", hue="smoker", rug=True)

#### Plots for categorical data

Several specialized plot types in seaborn are oriented towards visualizing categorical data. They can be accessed through ```catplot()```. These plots offer different levels of granularity. At the finest level, you may wish to see every observation by drawing a “swarm” plot: a scatter plot that adjusts the positions of the points along the categorical axis so that they don’t overlap:

In [None]:
sns.catplot(data=tips, kind="swarm", x="day", y="total_bill", hue="smoker")

Alternately, you could use kernel density estimation to represent the underlying distribution that the points are sampled from:

In [100]:
sns.catplot(data=tips, kind="violin", x="day", y="total_bill", hue="smoker", split=True)

NameError: name 'tips' is not defined

Or you could show only the mean value and its confidence interval within each nested category:

In [101]:
sns.catplot(data=tips, kind="bar", x="day", y="total_bill", hue="smoker")

NameError: name 'tips' is not defined

#### Multivariate views on complex datasets

Some seaborn functions combine multiple kinds of plots to quickly give informative summaries of a dataset. One, ```jointplot()```, focuses on a single relationship. It plots the joint distribution between two variables along with each variable’s marginal distribution:

In [102]:
penguins = sns.load_dataset("penguins")
sns.jointplot(data=penguins, x="flipper_length_mm", y="bill_length_mm", hue="species")

URLError: <urlopen error [Errno 11004] getaddrinfo failed>

The other, ```pairplot()```, takes a broader view: it shows joint and marginal distributions for all pairwise relationships and for each variable, respectively:

In [103]:
sns.pairplot(data=penguins, hue="species")

NameError: name 'penguins' is not defined

#### Lower-level tools for building figures

These tools work by combining [axes-level](https://seaborn.pydata.org/tutorial/function_overview.html) plotting functions with objects that manage the layout of the figure, linking the structure of a dataset to a [grid of axes](https://seaborn.pydata.org/tutorial/axis_grids.html). Both elements are part of the public API, and you can use them directly to create complex figures with only a few more lines of code:

In [104]:
g = sns.PairGrid(penguins, hue="species", corner=True)
g.map_lower(sns.kdeplot, hue=None, levels=5, color=".2")
g.map_lower(sns.scatterplot, marker="+")
g.map_diag(sns.histplot, element="step", linewidth=0, kde=True)
g.add_legend(frameon=True)
g.legend.set_bbox_to_anchor((.61, .6))

NameError: name 'penguins' is not defined

#### Opinionated defaults and flexible customization

Seaborn creates complete graphics with a single function call: when possible, its functions will automatically add informative axis labels and legends that explain the semantic mappings in the plot.

In many cases, seaborn will also choose default values for its parameters based on characteristics of the data. For example, the [color mappings](https://seaborn.pydata.org/tutorial/color_palettes.html) that we have seen so far used distinct hues (blue, orange, and sometimes green) to represent different levels of the categorical variables assigned to ```hue```. When mapping a numeric variable, some functions will switch to a continuous gradient:

In [105]:
sns.relplot(
    data=penguins,
    x="bill_length_mm", y="bill_depth_mm", hue="body_mass_g"
)


NameError: name 'penguins' is not defined

When you’re ready to share or publish your work, you’ll probably want to polish the figure beyond what the defaults achieve. Seaborn allows for several levels of customization. It defines multiple built-in [themes](https://seaborn.pydata.org/tutorial/aesthetics.html) that apply to all figures, its functions have standardized parameters that can modify the semantic mappings for each plot, and additional keyword arguments are passed down to the underlying matplotlib artists, allowing even more control. Once you’ve created a plot, its properties can be modified through both the seaborn API and by dropping down to the matplotlib layer for fine-grained tweaking:

In [106]:
sns.set_theme(style="ticks", font_scale=1.25)
g = sns.relplot(
    data=penguins,
    x="bill_length_mm", y="bill_depth_mm", hue="body_mass_g",
    palette="crest", marker="x", s=100,
)
g.set_axis_labels("Bill length (mm)", "Bill depth (mm)", labelpad=10)
g.legend.set_title("Body mass (g)")
g.figure.set_size_inches(6.5, 4.5)
g.ax.margins(.15)
g.despine(trim=True)

NameError: name 'penguins' is not defined

#### Relationship to matplotlib

Seaborn’s integration with matplotlib allows you to use it across the many environments that matplotlib supports, including exploratory analysis in notebooks, real-time interaction in GUI applications, and archival output in a number of raster and vector formats.

While you can be productive using only seaborn functions, full customization of your graphics will require some knowledge of matplotlib’s concepts and API. One aspect of the learning curve for new users of seaborn will be knowing when dropping down to the matplotlib layer is necessary to achieve a particular customization. On the other hand, users coming from matplotlib will find that much of their knowledge transfers.

Matplotlib has a comprehensive and powerful API; just about any attribute of the figure can be changed to your liking. A combination of seaborn’s high-level interface and matplotlib’s deep customizability will allow you both to quickly explore your data and to create graphics that can be tailored into a publication quality final product.

#### Next steps

You have a few options for where to go next. You can browse the [example gallery](https://seaborn.pydata.org/examples/index.html) to get a broader sense for what kind of graphics seaborn can produce. Or you can read through the rest of the [user guide and tutorial](https://seaborn.pydata.org/tutorial.html) for a deeper discussion of the different tools and what they are designed to accomplish. If you have a specific plot in mind and want to know how to make it, you could check out the [API reference](https://seaborn.pydata.org/api.html), which documents each function’s parameters and shows many examples to illustrate usage.

### scikit-learn

The purpose of this guide is to illustrate some of the main features that scikit-learn provides. It assumes a very basic working knowledge of machine learning practices (model fitting, predicting, cross-validation, etc.). Please refer to our [installation instructions](https://scikit-learn.org/stable/install.html#installation-instructions) for installing scikit-learn.

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

#### Fitting and predicting: estimator basics

Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its fit method.

Here is a simple example where we fit a `RandomForestClassifier` to some very basic data:

In [107]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[ 1,  2,  3], [11, 12, 13]] # 2 samples, 3 features
y = [0, 1]  # classes of each sample

clf.fit(X, y)

The `fit` method generally accepts 2 inputs:

- The samples matrix (or design matrix) X. The size of X is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns.
- The target values y which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervised learning tasks, y does not need to be specified. y is usually a 1d array where the i th entry corresponds to the target of the i th sample (row) of X.

Both X and y are usually expected to be numpy arrays or equivalent [array-like](https://scikit-learn.org/stable/glossary.html#term-array-like) data types, though some estimators work with other formats such as sparse matrices.

Once the estimator is fitted, it can be used for predicting target values of new data. You don’t need to re-train the estimator:

In [108]:
clf.predict(X)  # predict classes of the training data
clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data

array([0, 1])

#### Transformers and pre-processors

Machine learning workflows are often composed of different parts. A typical pipeline consists of a pre-processing step that transforms or imputes the data, and a final predictor that predicts target values.

In scikit-learn, pre-processors and transformers follow the same API as the estimator objects (they actually all inherit from the same BaseEstimator class). The transformer objects don’t have a `predict` method but rather a `transform` method that outputs a newly transformed sample matrix X:


In [109]:
from sklearn.preprocessing import StandardScaler
X = [[0, 15], [1, -10]]

# scale data according to computed scaling values
StandardScaler().fit(X).transform(X)

array([[-1.,  1.],
       [ 1., -1.]])

Sometimes, you want to apply different transformations to different features: the ColumnTransformer is designed for these use-cases.
Pipelines: chaining pre-processors and estimators

Transformers and estimators (predictors) can be combined together into a single unifying object: a `Pipeline`. The pipeline offers the same API as a regular estimator: it can be fitted and used for prediction with `fit` and `predict`. As we will see later, using a pipeline will also prevent you from data leakage, i.e. disclosing some testing data in your training data.

In the following example, we load the Iris dataset, split it into train and test sets, and compute the accuracy score of a pipeline on the test data:

In [110]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# create a pipeline object
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)

# load the iris dataset and split it into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# we can now use it like any other estimator
accuracy_score(pipe.predict(X_test), y_test)

0.9736842105263158

#### Model evaluation

Fitting a model to some data does not entail that it will predict well on unseen data. This needs to be directly evaluated. We have just seen the `train_test_split` helper that splits a dataset into train and test sets, but scikit-learn provides many other tools for model evaluation, in particular for [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation).

We here briefly show how to perform a 5-fold cross-validation procedure, using the `cross_validate` helper. Note that it is also possible to manually iterate over the folds, use different data splitting strategies, and use custom scoring functions. Please refer to our [User Guide](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) for more details:

In [111]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()

result = cross_validate(lr, X, y)  # defaults to 5-fold CV
result['test_score']  # r_squared score is high because dataset is easy


array([1., 1., 1., 1., 1.])

#### Automatic parameter searches

All estimators have parameters (often called hyper-parameters in the literature) that can be tuned. The generalization power of an estimator often critically depends on a few parameters. For example a [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor) has a `n_estimators` parameter that determines the number of trees in the forest, and a `max_depth` parameter that determines the maximum depth of each tree. Quite often, it is not clear what the exact values of these parameters should be since they depend on the data at hand.

Scikit-learn provides tools to automatically find the best parameter combinations (via cross-validation). In the following example, we randomly search over the parameter space of a random forest with a [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) object. When the search is over, the RandomizedSearchCV behaves as a [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor) that has been fitted with the best set of parameters. Read more in the [User Guide](https://scikit-learn.org/stable/modules/grid_search.html#grid-search):

In [112]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# define the parameter space that will be searched over
param_distributions = {'n_estimators': randint(1, 5), 'max_depth': randint(5, 10)}

# now create a searchCV object and fit it to the data
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
    n_iter=5,
    param_distributions=param_distributions,
    random_state=0)

search.fit(X_train, y_train)

search.best_params_

# the search object now acts like a normal random forest estimator
# with max_depth=9 and n_estimators=4
search.score(X_test, y_test)

HTTPError: HTTP Error 403: Forbidden

Note: In practice, you almost always want to [search over a pipeline](https://scikit-learn.org/stable/modules/grid_search.html#composite-grid-search), instead of a single estimator. One of the main reasons is that if you apply a pre-processing step to the whole dataset without using a pipeline, and then perform any kind of cross-validation, you would be breaking the fundamental assumption of independence between training and testing data. Indeed, since you pre-processed the data using the whole dataset, some information about the test sets are available to the train sets. This will lead to over-estimating the generalization power of the estimator (you can read more in this [Kaggle post](https://www.kaggle.com/alexisbcook/data-leakage)).

Using a pipeline for cross-validation and searching will largely keep you from this common pitfall.

#### Next steps

We have briefly covered estimator fitting and predicting, pre-processing steps, pipelines, cross-validation tools and automatic hyper-parameter searches. This guide should give you an overview of some of the main features of the library, but there is much more to scikit-learn!

Please refer to our [User Guide](https://scikit-learn.org/stable/user_guide.html#user-guide) for details on all the tools that we provide. You can also find an exhaustive list of the public API in the [API Reference](https://scikit-learn.org/stable/modules/classes.html#api-ref).

You can also look at our numerous [examples](https://scikit-learn.org/stable/auto_examples/index.html#general-examples) that illustrate the use of scikit-learn in many different contexts.

The [tutorials](https://scikit-learn.org/stable/tutorial/index.html#tutorial-menu) also contain additional learning resources.


## Finding Datasets with Kaggle

[Kaggle](https://www.kaggle.com/datasets) is a great resource for finding datasets and examples of data analyses on them. For the last section of the Lab, please visit Kaggle and have a look around some of the datasets that they have available. You can then use these in future labs, to explore, analyse, and visualise datasets that you find interesting.

Some examples to get you started:
- [Titanic](https://www.kaggle.com/datasets/heptapod/titanic) -- a very popular dataset for tutorials and guides, providing information about all of the passengers on the ill-fated Titanic voyage.
- [Video Game Sales](https://www.kaggle.com/datasets/gregorut/videogamesales) -- Analyze sales data from more than 16,500 games.
- [Students Performance in Exams](https://www.kaggle.com/datasets/spscientist/students-performance-in-exams) -- Marks secured by the students in various subjects.
- [Heart Attack Analysis & Prediction Dataset](https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset) -- A dataset for heart attack classification
- [Red Wine Quality](https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009) -- Simple and clean practice dataset for regression or classification modelling
- [Customer Personality Analysis](https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis) -- Analysis of company's ideal customers
