# Reading data

In `creme`, the features of a sample are stored inside a dictionary, which in Python is called a `dict` and is a native data structure. In other words, we don't use any sophisticated data structure, such as a `numpy.ndarray` or a `pandas.DataFrame`.

The main advantage of using plain `dict`s is that it removes the overhead that comes with using the aforementioned data structures. This is important in a streaming context because we want to be able to process many individual samples in rapid succession. Another advantage is that `dict`s allow us to give names to our features. Finally, `dict`s are not typed, and can therefore store heterogenous data.

Another advantage which we haven't mentionned is that `dict`s play nicely with Python's standard library. Indeed, Python contains many tools that allow manipulating `dict`s. For instance, the `csv.DictReader` can be used to read a CSV file and convert each row to a `dict`. In fact, the `stream.iter_csv` method from `creme` is just a wrapper on top of `csv.DictReader` that adds a few bells and whistles.

`creme` provides some out-of-the-box datasets to get you started:

In [2]:
from creme import stream

stream.available_datasets()

['AirlinePassengers',
 'Bananas',
 'Bikes',
 'ChickWeights',
 'CreditCard',
 'Elec2',
 'HTTP',
 'Higgs',
 'ImageSegments',
 'Insects',
 'MaliciousURL',
 'MovieLens100K',
 'Music',
 'Phishing',
 'Restaurants',
 'SMSSpam',
 'SMTP',
 'SolarFlare',
 'TREC07',
 'Taxis',
 'TrumpApproval']

Each of the above datasets can be loaded with the `stream_iter_dataset` function:

In [3]:
bikes = stream.iter_dataset('Bikes')
bikes

Bikes dataset

              Task  Regression                                                    
 Number of samples  182,470                                                       
Number of features  8                                                             
            Sparse  False                                                         
              Path  /Users/mhalford/creme_data/Bikes/toulouse_bikes.csv           
               URL  https://maxhalford.github.io/files/datasets/toulouse_bikes.zip
              Size  12.52 MB                                                      
        Downloaded  True                                                          

Note that when we say "loaded", we don't mean that the actual data is read from the disk. On the contrary, the dataset is a streaming data that can be iterated over one sample at a time. In Python lingo, it's a [generator](https://realpython.com/introduction-to-python-generators/).

Let's take a look at the first sample:

In [4]:
x, y = next(iter(bikes))
x

{'moment': datetime.datetime(2016, 4, 1, 0, 0, 7),
 'station': 'metro-canal-du-midi',
 'clouds': 75,
 'description': 'light rain',
 'humidity': 81,
 'pressure': 1017.0,
 'temperature': 6.54,
 'wind': 9.3}

As we can see, the values have different types.

Under the hood, `stream.iter_dataset` simply iterates over a file and parses each value appropriately. We can do this ourselves by using `stream.iter_csv`:

In [5]:
X_y = stream.iter_csv(bikes.path)
x, y = next(X_y)
x, y

({'moment': '2016-04-01 00:00:07',
  'bikes': '1',
  'station': 'metro-canal-du-midi',
  'clouds': '75',
  'description': 'light rain',
  'humidity': '81',
  'pressure': '1017.0',
  'temperature': '6.54',
  'wind': '9.3'},
 None)

There are a couple things that are wrong. First of all, the numeric features have not been casted into numbers. Indeed, by default, `stream.iter_csv` assumes that everything is a string. A related issue is that the `moment` field hasn't been parsed into a `datetime`. Finally, the target field, which is `bikes`, hasn't been separated from the rest of the features. We can remedy to these issues by setting a few parameters:

In [4]:
X_y = stream.iter_csv(
    bikes.path,
    converters={
        'bikes': int,
        'clouds': int,
        'humidity': int,
        'pressure': float,
        'temperature': float,
        'wind': float
    },
    parse_dates={'moment': '%Y-%m-%d %H:%M:%S'},
    target='bikes'
)
x, y = next(X_y)
x, y

({'moment': datetime.datetime(2016, 4, 1, 0, 0, 7),
  'station': 'metro-canal-du-midi',
  'clouds': 75,
  'description': 'light rain',
  'humidity': 81,
  'pressure': 1017.0,
  'temperature': 6.54,
  'wind': 9.3},
 1)

That's much better. We invite you to take a look at the `stream` module to see for yourself what other methods are available. Note that `creme` is first and foremost a machine learning library, and therefore isn't as much concerned about reading data as it is about statistical algorithms. We do however believe that the fact that we use dictionary gives you, the user, a lot of freedom and flexibility.

To conclude, let us shortly mention between *proactive learning* and *reactive learning* in the specific context of online machine learning. When we loop over a data with a `for` loop, we have the control over the data and the order in which it arrives. We are proactive in the sense that we, the user, are asking for the data to arrive.

In contract, in a reactive situation, we don't have control on the data arrival. A typical example of such a situation is web server, where web requests arrive in an arbitrary order. This is a situation where `creme` shines. For instance, in a [Flask](https://flask.palletsprojects.com/en/1.1.x/) application, you could define a route to make predictions with a `creme` model as so:

In [5]:
import flask

app = flask.Flask(__name__)

@app.route('/', methods=['GET'])
def predict():
    payload = flask.request.json
    creme_model = load_model()
    return creme_model.predict_proba_one(payload)

Likewise, a model can be updated whenever a request arrives as so:

In [6]:
@app.route('/', methods=['POST'])
def learn():
    payload = flask.request.json
    creme_model = load_model()
    creme_model.learn_one(payload['features'], payload['target'])
    return {}, 201

To summarize, `creme` can be used in many different ways. The fact that it uses dictionaries to represent features provides a lot of flexibility and space for creativity.