# Download only faces emoji from twitter github

This requires libraries `lxml` ("library for XML") and `bs4` ("Beautiful soup 4"), which you'll have to install separately:

    source activate cshl-sca-2017
    conda install lxml
    pip install bs4

In [13]:
import pandas as pd

data = pd.read_html('http://unicode.org/emoji/charts/full-emoji-list.html')
data

[                    0                                                  1   \
 0     Smileys & People                                                NaN   
 1        face-positive                                                NaN   
 2                    №                                               Code   
 3                    1                                            U+1F600   
 4                    2                                            U+1F601   
 5                    3                                            U+1F602   
 6                    4                                            U+1F923   
 7                    5                                            U+1F603   
 8                    6                                            U+1F604   
 9                    7                                            U+1F605   
 10                   8                                            U+1F606   
 11                   9                                         

`data` returns a list so take the first item of the list and call it `df`

In [14]:
df = data[0]
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,Smileys & People,,,,,,,,,,,,,,,
1,face-positive,,,,,,,,,,,,,,,
2,№,Code,Browser,Appl,Googᵈ,Twtr.,One,FB,FBM,Sams.,Wind.,GMail,SB,DCM,KDDI,CLDR Short Name
3,1,U+1F600,😀,,,,,,,,,,—,—,—,grinning face
4,2,U+1F601,😁,,,,,,,,,,,,,beaming face with smiling eyes


See which entries in column 15 have the word 'face'

In [15]:
rows = df[15].str.contains('face')
rows.head()

0      NaN
1      NaN
2    False
3     True
4     True
Name: 15, dtype: object

Replace all NaNs with False because pandas doesn't know what to do with NaNs

In [16]:
rows = rows.fillna(False)
rows.head()

0    False
1    False
2    False
3     True
4     True
Name: 15, dtype: bool

In [17]:
# Subset the dataframe using the rows with faces
faces = df.loc[rows]
print(faces.shape)
faces.head()

(135, 16)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
3,1,U+1F600,😀,,,,,,,,,,—,—,—,grinning face
4,2,U+1F601,😁,,,,,,,,,,,,,beaming face with smiling eyes
5,3,U+1F602,😂,,,,,,,,,,,—,,face with tears of joy
7,5,U+1F603,😃,,,,,,,,,,,,,grinning face with big eyes
8,6,U+1F604,😄,,,,,,,,,,,—,—,grinning face with smiling eyes


Split the "U+XXXXX" column on the "+" and get the second item (the `1`th item)

In [18]:
faces[1].str.split('+').head()

3    [U, 1F600]
4    [U, 1F601]
5    [U, 1F602]
7    [U, 1F603]
8    [U, 1F604]
Name: 1, dtype: object

Get the 1th item (not the U but the numbers)

In [19]:
faces[1].str.split('+').str[1].head()

3    1F600
4    1F601
5    1F602
7    1F603
8    1F604
Name: 1, dtype: object

In [28]:
faces['ids'] = faces[1].str.split('+').str[1]
faces.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,ids
3,1,U+1F600,😀,,,,,,,,,,—,—,—,grinning face,1F600
4,2,U+1F601,😁,,,,,,,,,,,,,beaming face with smiling eyes,1F601
5,3,U+1F602,😂,,,,,,,,,,,—,,face with tears of joy,1F602
7,5,U+1F603,😃,,,,,,,,,,,,,grinning face with big eyes,1F603
8,6,U+1F604,😄,,,,,,,,,,,—,—,grinning face with smiling eyes,1F604


All the ids on the twitter emoji are lowercase so convert the letters to lowercase

In [37]:
faces.loc[:, 'ids'] = faces['ids'].str.lower()
faces.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,ids
3,1,U+1F600,😀,,,,,,,,,,—,—,—,grinning face,1f600
4,2,U+1F601,😁,,,,,,,,,,,,,beaming face with smiling eyes,1f601
5,3,U+1F602,😂,,,,,,,,,,,—,,face with tears of joy,1f602
7,5,U+1F603,😃,,,,,,,,,,,,,grinning face with big eyes,1f603
8,6,U+1F604,😄,,,,,,,,,,,—,—,grinning face with smiling eyes,1f604


In [None]:
import os

os.mkdir('faces')

In [None]:
cd faces

Want Twitter emoji version 2. [File list](https://github.com/twitter/twemoji/tree/gh-pages/2/svg). We can use the Twitter emoji because they have an [Open Source](https://twitter.github.io/twemoji/) license.

Example URL:

    https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f171.svg

In [38]:
pwd

'/Users/olgabot/code/cshl-singlecell-2017/notebooks/faces'

In [39]:
for emoji_id in faces['ids']:
    url = f'https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/{emoji_id}.svg'
    ! wget $url

--2017-06-30 20:24:25--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f600.svg
Resolving raw.githubusercontent.com... 151.101.208.133
Connecting to raw.githubusercontent.com|151.101.208.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2067 (2.0K) [text/plain]
Saving to: ‘1f600.svg’


2017-06-30 20:24:26 (11.0 MB/s) - ‘1f600.svg’ saved [2067/2067]

--2017-06-30 20:24:26--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f601.svg
Resolving raw.githubusercontent.com... 151.101.208.133
Connecting to raw.githubusercontent.com|151.101.208.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2742 (2.7K) [text/plain]
Saving to: ‘1f601.svg’


2017-06-30 20:24:26 (34.9 MB/s) - ‘1f601.svg’ saved [2742/2742]

--2017-06-30 20:24:26--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f602.svg
Resolving raw.githubusercontent.com... 151.101.208.133
Connecting to raw.githubusercontent.com|151.1

HTTP request sent, awaiting response... 200 OK
Length: 2066 (2.0K) [text/plain]
Saving to: ‘1f642.svg’


2017-06-30 20:24:32 (24.9 MB/s) - ‘1f642.svg’ saved [2066/2066]

--2017-06-30 20:24:32--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f917.svg
Resolving raw.githubusercontent.com... 151.101.208.133
Connecting to raw.githubusercontent.com|151.101.208.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5031 (4.9K) [text/plain]
Saving to: ‘1f917.svg’


2017-06-30 20:24:32 (61.5 MB/s) - ‘1f917.svg’ saved [5031/5031]

--2017-06-30 20:24:32--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f914.svg
Resolving raw.githubusercontent.com... 151.101.208.133
Connecting to raw.githubusercontent.com|151.101.208.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3642 (3.6K) [text/plain]
Saving to: ‘1f914.svg’


2017-06-30 20:24:32 (52.6 MB/s) - ‘1f914.svg’ saved [3642/3642]

--2017-06-30 20:24:32--  https

HTTP request sent, awaiting response... 200 OK
Length: 3177 (3.1K) [text/plain]
Saving to: ‘1f60c.svg’


2017-06-30 20:24:37 (23.9 MB/s) - ‘1f60c.svg’ saved [3177/3177]

--2017-06-30 20:24:37--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f61b.svg
Resolving raw.githubusercontent.com... 151.101.208.133
Connecting to raw.githubusercontent.com|151.101.208.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2011 (2.0K) [text/plain]
Saving to: ‘1f61b.svg’


2017-06-30 20:24:38 (15.8 MB/s) - ‘1f61b.svg’ saved [2011/2011]

--2017-06-30 20:24:38--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f61c.svg
Resolving raw.githubusercontent.com... 151.101.208.133
Connecting to raw.githubusercontent.com|151.101.208.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2483 (2.4K) [text/plain]
Saving to: ‘1f61c.svg’


2017-06-30 20:24:38 (19.4 MB/s) - ‘1f61c.svg’ saved [2483/2483]

--2017-06-30 20:24:38--  https

HTTP request sent, awaiting response... 200 OK
Length: 4602 (4.5K) [text/plain]
Saving to: ‘1f624.svg’


2017-06-30 20:24:43 (31.3 MB/s) - ‘1f624.svg’ saved [4602/4602]

--2017-06-30 20:24:43--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f622.svg
Resolving raw.githubusercontent.com... 151.101.208.133
Connecting to raw.githubusercontent.com|151.101.208.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3012 (2.9K) [text/plain]
Saving to: ‘1f622.svg’


2017-06-30 20:24:44 (31.9 MB/s) - ‘1f622.svg’ saved [3012/3012]

--2017-06-30 20:24:44--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f62d.svg
Resolving raw.githubusercontent.com... 151.101.208.133
Connecting to raw.githubusercontent.com|151.101.208.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3534 (3.5K) [text/plain]
Saving to: ‘1f62d.svg’


2017-06-30 20:24:44 (23.9 MB/s) - ‘1f62d.svg’ saved [3534/3534]

--2017-06-30 20:24:44--  https

HTTP request sent, awaiting response... 200 OK
Length: 4867 (4.8K) [text/plain]
Saving to: ‘1f912.svg’


2017-06-30 20:24:49 (29.2 MB/s) - ‘1f912.svg’ saved [4867/4867]

--2017-06-30 20:24:49--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f915.svg
Resolving raw.githubusercontent.com... 151.101.20.133
Connecting to raw.githubusercontent.com|151.101.20.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4258 (4.2K) [text/plain]
Saving to: ‘1f915.svg’


2017-06-30 20:24:49 (45.1 MB/s) - ‘1f915.svg’ saved [4258/4258]

--2017-06-30 20:24:50--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f922.svg
Resolving raw.githubusercontent.com... 151.101.20.133
Connecting to raw.githubusercontent.com|151.101.20.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3075 (3.0K) [text/plain]
Saving to: ‘1f922.svg’


2017-06-30 20:24:50 (39.6 MB/s) - ‘1f922.svg’ saved [3075/3075]

--2017-06-30 20:24:50--  https://r

--2017-06-30 20:24:55--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f639.svg
Resolving raw.githubusercontent.com... 151.101.20.133
Connecting to raw.githubusercontent.com|151.101.20.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5677 (5.5K) [text/plain]
Saving to: ‘1f639.svg’


2017-06-30 20:24:55 (30.2 MB/s) - ‘1f639.svg’ saved [5677/5677]

--2017-06-30 20:24:55--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f63b.svg
Resolving raw.githubusercontent.com... 151.101.20.133
Connecting to raw.githubusercontent.com|151.101.20.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4631 (4.5K) [text/plain]
Saving to: ‘1f63b.svg’


2017-06-30 20:24:55 (32.5 MB/s) - ‘1f63b.svg’ saved [4631/4631]

--2017-06-30 20:24:55--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f63c.svg
Resolving raw.githubusercontent.com... 151.101.20.133
Connecting to raw.githubusercontent.com|151.101.20

--2017-06-30 20:24:58--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f926
Resolving raw.githubusercontent.com... 151.101.20.133
Connecting to raw.githubusercontent.com|151.101.20.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2017-06-30 20:24:58 ERROR 404: Not Found.

--2017-06-30 20:24:58--  http://u.svg/
Resolving u.svg... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address ‘u.svg’
--2017-06-30 20:24:58--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f926
Resolving raw.githubusercontent.com... 151.101.20.133
Connecting to raw.githubusercontent.com|151.101.20.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2017-06-30 20:24:58 ERROR 404: Not Found.

--2017-06-30 20:24:58--  http://u.svg/
Resolving u.svg... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address ‘u.svg’
--2017-06-30 20:24:58--  https://raw.githubu

HTTP request sent, awaiting response... 200 OK
Length: 3188 (3.1K) [text/plain]
Saving to: ‘1f437.svg’


2017-06-30 20:25:01 (38.5 MB/s) - ‘1f437.svg’ saved [3188/3188]

--2017-06-30 20:25:01--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f42d.svg
Resolving raw.githubusercontent.com... 151.101.20.133
Connecting to raw.githubusercontent.com|151.101.20.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4404 (4.3K) [text/plain]
Saving to: ‘1f42d.svg’


2017-06-30 20:25:01 (32.3 MB/s) - ‘1f42d.svg’ saved [4404/4404]

--2017-06-30 20:25:01--  https://raw.githubusercontent.com/twitter/twemoji/gh-pages/2/svg/1f439.svg
Resolving raw.githubusercontent.com... 151.101.20.133
Connecting to raw.githubusercontent.com|151.101.20.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5628 (5.5K) [text/plain]
Saving to: ‘1f439.svg’


2017-06-30 20:25:02 (39.5 MB/s) - ‘1f439.svg’ saved [5628/5628]

--2017-06-30 20:25:02--  https://r