# TP2 DATA831

Par RISS Ryan
IDU4

## Partie I : Installation

Pyspark + pandas

In [None]:
pip install pyspark

In [None]:
pip install pandas

## Partie II : Utilisation de PySpark

### Utilisation de fichiers .txt

On crée d'abord une configuration Spark et un contexte Spark.

In [None]:
from pyspark import SparkContext, SparkConf

In [None]:
conf = SparkConf()
# create Spark context with Spark configuration
sc = SparkContext.getOrCreate(conf=conf)

Puis on teste Spark avec un map reduce sur le fichier test.txt.

test.txt ayant pour contenu : *Je JE A; A ;*

On veut ici tester si spark détecte d'abord les différents mots et si il est sensible aux minuscules/majuscules et aux ponctuations.

In [None]:

 # read in text file and split each document into words
words = sc.textFile("data/test.txt").flatMap(lambda line: line.split(" "))
 # count the occurrence of each word
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b)

print(wordCounts.collect())

Le map reduce de Spark considère un mot comme une suite de caractères attachées et est sensibles aux différence entre min/majuscules.

### Utilisation de CSVs

In [None]:
from pyspark.sql import SparkSession

spSession = SparkSession.builder.appName("Ok").getOrCreate()

On charge le csv, on afficher son schéma et puis on affiche les 10 premières entrées.

In [54]:
df = spSession.read.csv("data/epl1.csv", header=True, inferSchema=True, sep = ';')

df.printSchema()

df.show(10)

root
 |-- _c0: integer (nullable = true)
 |-- assists_away_team: integer (nullable = true)
 |-- assists_home_team: integer (nullable = true)
 |-- attendance: integer (nullable = true)
 |-- away_goals: integer (nullable = true)
 |-- away_goals_details: string (nullable = true)
 |-- away_manager: string (nullable = true)
 |-- away_team: string (nullable = true)
 |-- blocks_away_team: integer (nullable = true)
 |-- blocks_home_team: integer (nullable = true)
 |-- clearances_away_team: integer (nullable = true)
 |-- clearances_home_team: integer (nullable = true)
 |-- corners_away_team: integer (nullable = true)
 |-- corners_home_team: integer (nullable = true)
 |-- crosses_away_team: integer (nullable = true)
 |-- crosses_home_team: integer (nullable = true)
 |-- date: string (nullable = true)
 |-- fouls_away_team: integer (nullable = true)
 |-- fouls_home_team: integer (nullable = true)
 |-- free_kicks_away_team: integer (nullable = true)
 |-- free_kicks_home_team: integer (nullable = tr

23/05/04 19:14:30 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
23/05/04 19:14:30 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , assists_away_team, assists_home_team, attendance, away_goals, away_goals_details, away_manager, away_team, blocks_away_team, blocks_home_team, clearances_away_team, clearances_home_team, corners_away_team, corners_home_team, crosses_away_team, crosses_home_team, date, fouls_away_team, fouls_home_team, free_kicks_away_team, free_kicks_home_team, handballs_away_team, handballs_home_team, home_goals, home_goals_details, home_manager, home_team, offsides_away_team, offsides_home_team, penalties_away_team, penalties_home_team, red_cards_away_team, red_cards_home_team, referee, result, saves_away_team, saves_home_team, season, shots_off_target_away_team, shots_off_target_home_team, shots_on_target_away_team, shots_on_targ

+---+-----------------+-----------------+----------+----------+--------------------+-------------------+--------------+----------------+----------------+--------------------+--------------------+-----------------+-----------------+-----------------+-----------------+-------------------+---------------+---------------+--------------------+--------------------+-------------------+-------------------+----------+--------------------+--------------------+-----------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+------+---------------+---------------+---------+--------------------------+--------------------------+-------------------------+-------------------------+-------------------+-------------------+---------------------+---------------------+------------------+----------------------+----------------------+
|_c0|assists_away_team|assists_home_team|attendance|away_goals|  away_goals_details|       awa

## Partie III : Utilisation de pandas

On importe d'abord pandas et reduce()

In [72]:
import pandas as pd
from functools import reduce

On lis le csv

In [68]:
df = pd.read_csv("data/epl1.csv", sep=";")

print(df)

     Unnamed: 0  assists_away_team  assists_home_team  attendance  away_goals   
0             0                  0                  3       74363           1  \
1             1                  0                  3       60007           0   
2             2                  1                  0       41494           1   
3             3                  0                  0       36691           0   
4             4                  1                  4       52183           1   
..          ...                ...                ...         ...         ...   
375         375                  1                  0       11155           1   
376         376                  1                  2       39063           2   
377         377                  2                  2       32242           2   
378         378                  3                  1       27036           3   
379         379                  0                  0       75261           0   

                           

### Nombre moyen de spectateurs par match

**Méthode normal**

In [62]:
df['attendance'].mean()

36451.836842105266

**Map reduce**

In [74]:
def mapper(x):
    return x

def reducer(acc, x):
    return acc + x

total = reduce(reducer, map(mapper, df['attendance']))

print(total/len(df))

36451.836842105266


On a le même résultat !

### Nombre moyen de buts par match

**Méthode classique**

In [77]:
sum = df['away_goals'].mean() + df['home_goals'].mean()

print("Resultat : " + str(sum/2))

Resultat : 1.35


**Map reduce**

In [79]:
sum = reduce(reducer, map(mapper, df['away_goals']))/len(df) + reduce(reducer, map(mapper, df['home_goals']))/len(df)

print("Resultat : " + str(sum/2))

Resultat : 1.35


### Nombre de buts pour chaque équipe

**Méthode classique**

In [67]:
print(df.groupby('away_team')['away_goals'].sum() + df.groupby('home_team')['home_goals'].sum())

away_team
Arsenal           65
Aston Villa       27
Bournemouth       45
Chelsea           59
Crystal Palace    39
Everton           59
Leicester         68
Liverpool         63
Man City          71
Man Utd           49
Newcastle         44
Norwich           39
Southampton       59
Spurs             69
Stoke             41
Sunderland        48
Swansea           42
Watford           40
West Brom         34
West Ham          65
dtype: int64


**Map reduce**

In [108]:
import collections

def mapperAway(row):
    return (row['away_team'], row['away_goals'])

def mapperHome(row):
    return (row['home_team'], row['home_goals'])


def reducer2(acc, x):
    team, goals = x
    acc[team] = acc.get(team, 0) + goals
    return acc

mp1 = reduce(reducer2, map(mapperAway, df.to_dict('records')), {})
mp2 = reduce(reducer2, map(mapperHome, df.to_dict('records')), {})

concatSeries = pd.concat([pd.Series(mp1), pd.Series(mp2)], axis=1)

concatSeries['total'] = concatSeries[0]+concatSeries[1]

print(concatSeries)

                 0   1  total
Bournemouth     22  23     45
Aston Villa     13  14     27
Leicester       33  35     68
Norwich         13  26     39
Spurs           34  35     69
Crystal Palace  20  19     39
West Ham        31  34     65
Man City        24  47     71
Sunderland      25  23     48
Liverpool       30  33     63
Chelsea         27  32     59
Watford         20  20     40
Everton         24  35     59
Man Utd         22  27     49
Arsenal         34  31     65
Southampton     20  39     59
Newcastle       12  32     44
West Brom       14  20     34
Stoke           19  22     41
Swansea         22  20     42
