# Loading data using iterators

It's often useful to load datafiles such as csv files in 'chunks' and iterator over each in turn since files can be very large(too big to store in memory).

Panda's `read_csv()` method provides an option, with the `chunksize` keyword which sets the size of the 'chunk' read in on each iteration(sets sample size). The object created by `read_csv()` is an iterable so we can use a `for loop` to iterate over it, each chunk will be a `dataframe`.

On each iteration we can grab the column of interest.

In [24]:
import pandas as pd
df = pd.read_csv('train.csv')
result = []

for chunk in pd.read_csv('train.csv', chunksize=100):
    print(chunk[['PassengerId', 'Name']])
    print('*' * 80)


    PassengerId                                               Name
0             1                            Braund, Mr. Owen Harris
1             2  Cumings, Mrs. John Bradley (Florence Briggs Th...
2             3                             Heikkinen, Miss. Laina
3             4       Futrelle, Mrs. Jacques Heath (Lily May Peel)
4             5                           Allen, Mr. William Henry
5             6                                   Moran, Mr. James
6             7                            McCarthy, Mr. Timothy J
7             8                     Palsson, Master. Gosta Leonard
8             9  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9            10                Nasser, Mrs. Nicholas (Adele Achem)
10           11                    Sandstrom, Miss. Marguerite Rut
11           12                           Bonnell, Miss. Elizabeth
12           13                     Saundercock, Mr. William Henry
13           14                        Andersson, Mr. Anders J

## Process Twitter Data

In [25]:
import pandas as pd

counts_dict = {}

# Iterate over the file chunk by chunk
for chunk in pd.read_csv('tweets.csv', chunksize=10):

    # Iterate over the column in DataFrame
    for entry in chunk['lang']:
        if entry in counts_dict.keys():
            counts_dict[entry] += 1
        else:
            counts_dict[entry] = 1

# Print the populated dictionary
print(counts_dict)

{'en': 97, 'et': 1, 'und': 2}


Refactor the above code, create a generic function

In [26]:
def count_entries(csv_file, c_size, colname):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Iterate over the file chunk by chunk
    for chunk in pd.read_csv(csv_file, chunksize=c_size):

        # Iterate over the column in DataFrame
        for entry in chunk[colname]:
            if entry in counts_dict.keys():
                counts_dict[entry] += 1
            else:
                counts_dict[entry] = 1

    # Return counts_dict
    return counts_dict

# Call count_entries(): result_counts
count_entries('tweets.csv', 10, 'lang')

{'en': 97, 'et': 1, 'und': 2}