/
feed.xml
executable file
·392 lines (300 loc) · 46.7 KB
/
feed.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>A place to learn and share</title>
<description>Personal page of data science amator</description>
<link>http://ferodia.github.io/</link>
<atom:link href="http://ferodia.github.io/feed.xml" rel="self" type="application/rss+xml"/>
<pubDate>Sun, 24 Jul 2016 16:22:34 +0200</pubDate>
<lastBuildDate>Sun, 24 Jul 2016 16:22:34 +0200</lastBuildDate>
<generator>Jekyll v3.1.1</generator>
<item>
<title>Linkedin Data scraping with BeautifulSoup</title>
<description><p>Today I would like to do some web scraping of Linkedin job postings, I have two
ways to go:
- Source code extraction
- Using the Linkedin API</p>
<p>I chose the first option, mainly because the API is poorly documented and I
wanted to experiment with BeautifulSoup.
BeautifulSoup in few words is a library that parses HTML pages and makes it easy
to extract the data.</p>
<p>Official page: <a href="https://www.crummy.com/software/BeautifulSoup/">BeautifulSoup web page</a></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">## Main packages needed are ulrlib2 to make url queries and beautifulSoup to structure the results</span>
<span class="c">## the imports needed for this experiment</span>
<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="kn">import</span> <span class="nn">urllib2</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="c"># get source code of the page</span>
<span class="k">def</span> <span class="nf">get_url</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="k">return</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="c"># makes the source tree format like </span>
<span class="k">def</span> <span class="nf">beautify</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="n">source</span> <span class="o">=</span> <span class="n">get_url</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">return</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">source</span><span class="p">,</span><span class="s">&quot;html.parser&quot;</span><span class="p">)</span></code></pre></figure>
<p>Now that the functions are defined and libraries are imported, I’ll get job
postings of linkedin.<br />
The inspection of the source code of the page shows indications where to access
elements we are interested in.<br />
I basically achieved that by ‘inspecting elements’ using the browser.<br />
I will look for “Data scientist” postings. Note that I’ll keep the quotes in my
search because otherwise I’ll get unrelevant postings containing the words
“Data” and “Scientist”.<br />
Below we are only interested to find div element with class ‘results-context’,
which contains summary of the search, especially the number of items found.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">jobs</span> <span class="o">=</span> <span class="n">beautify</span><span class="p">(</span><span class="s">&#39;https://www.linkedin.com/jobs/search?keywords=%22Data+Scientist%22&amp;&#39;</span>
<span class="s">&#39;location=France&amp;trk=jobs_jserp_search_button_execute&amp;orig=JSERP&amp;locationId=fr%3A0&#39;</span><span class="p">)</span>
<span class="n">results_context</span> <span class="o">=</span> <span class="n">jobs</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&#39;div&#39;</span><span class="p">,</span> <span class="p">{</span><span class="s">&#39;class&#39;</span> <span class="p">:</span> <span class="s">&#39;results-context&#39;</span><span class="p">})</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&#39;strong&#39;</span><span class="p">)</span>
<span class="n">n_jobs</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">results_context</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">&#39;,&#39;</span><span class="p">,</span><span class="s">&#39;&#39;</span><span class="p">))</span>
<span class="k">print</span> <span class="s">&quot;###### Number of job postings #######&quot;</span>
<span class="k">print</span> <span class="n">n_jobs</span>
<span class="k">print</span> <span class="s">&quot;#####################################&quot;</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"> ###### Number of job postings #######
93
#####################################</code></pre></figure>
<p>Now let’s check the number of postings we got on one page</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">results</span> <span class="o">=</span> <span class="n">jobs</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&#39;li&#39;</span><span class="p">,</span> <span class="p">{</span><span class="s">&#39;class&#39;</span><span class="p">:</span> <span class="s">&#39;job-listing&#39;</span><span class="p">})</span>
<span class="n">n_postings</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
<span class="k">print</span> <span class="s">&quot;#### Number of job postings per page ####&quot;</span>
<span class="k">print</span> <span class="n">n_postings</span>
<span class="k">print</span> <span class="s">&quot;#########################################&quot;</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"> #### Number of job postings per page ####
25
#########################################</code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">print</span> <span class="s">&quot;#### Number of pages ####&quot;</span>
<span class="n">n_pages</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="nb">round</span><span class="p">(</span><span class="n">n_jobs</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">n_postings</span><span class="p">)))</span>
<span class="k">print</span> <span class="n">n_pages</span>
<span class="k">print</span> <span class="s">&quot;#########################&quot;</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"> #### Number of pages ####
4
#########################</code></pre></figure>
<p>To be able to extract all postings, I need to iterate over the pages, therefore
I will proceed with examining the urls of the different pages to work out the
logic.</p>
<ul>
<li>
<p>url of the first page</p>
</li>
<li>
<p>https://www.linkedin.com/jobs/search?keywords=Data+Scientist&amp;locationId=fr:0&amp;s
tart=0&amp;count=25&amp;trk=jobs_jserp_pagination_1</p>
</li>
<li>
<p>second page</p>
</li>
<li>
<p>https://www.linkedin.com/jobs/search?keywords=Data+Scientist&amp;locationId=fr:0&amp;s
tart=25&amp;count=25&amp;trk=jobs_jserp_pagination_2</p>
</li>
<li>
<p>third page</p>
</li>
<li>
<p>https://www.linkedin.com/jobs/search?keywords=Data+Scientist&amp;locationId=fr:0&amp;s
tart=50&amp;count=25&amp;trk=jobs_jserp_pagination_3</p>
</li>
</ul>
<p>there are two elements changing :<br />
- start=25 which is a product of page number and 25<br />
- trk=jobs_jserp_pagination_3</p>
<p>I also noticed that the pagination number doesn’t have to be changed to go to
next page, which means I can change only start value to get the next postings
(may be Linkedin developers should do something about it …)</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">titles</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">companies</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">locations</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">links</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c">#loop over all pages to get the posting details</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_pages</span><span class="p">):</span>
<span class="c"># define the base url for generic searching </span>
<span class="n">url</span> <span class="o">=</span> <span class="p">(</span><span class="s">&quot;http://www.linkedin.com/jobs/search?keywords=%22Data+Scientist%22&amp;locationId=&quot;</span>
<span class="s">&quot;fr:0&amp;start=nPostings&amp;count=25&amp;trk=jobs_jserp_pagination_1&quot;</span><span class="p">)</span>
<span class="n">url</span> <span class="o">=</span> <span class="n">url</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">&#39;nPostings&#39;</span><span class="p">,</span><span class="nb">str</span><span class="p">(</span><span class="mi">25</span><span class="o">*</span><span class="n">i</span><span class="p">))</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">beautify</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="c"># Build lists for each type of information</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&#39;li&#39;</span><span class="p">,</span> <span class="p">{</span><span class="s">&#39;class&#39;</span><span class="p">:</span> <span class="s">&#39;job-listing&#39;</span><span class="p">})</span>
<span class="n">results</span><span class="o">.</span><span class="n">sort</span><span class="p">()</span>
<span class="c"># print &quot;there are &quot;, len(results) , &quot; results&quot;</span>
<span class="k">for</span> <span class="n">res</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
<span class="c"># set only the value if get_text() </span>
<span class="n">titles</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">res</span><span class="o">.</span><span class="n">h2</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">span</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span> <span class="k">if</span> <span class="n">res</span><span class="o">.</span><span class="n">h2</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">span</span> <span class="k">else</span> <span class="s">&#39;None&#39;</span><span class="p">)</span>
<span class="n">companies</span><span class="o">.</span><span class="n">append</span><span class="p">(</span> <span class="n">res</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&#39;span&#39;</span><span class="p">,{</span><span class="s">&#39;class&#39;</span> <span class="p">:</span> <span class="s">&#39;company-name-text&#39;</span><span class="p">})</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span> <span class="k">if</span>
<span class="n">res</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&#39;span&#39;</span><span class="p">,{</span><span class="s">&#39;class&#39;</span> <span class="p">:</span> <span class="s">&#39;company-name-text&#39;</span><span class="p">})</span> <span class="k">else</span> <span class="s">&#39;None&#39;</span><span class="p">)</span>
<span class="n">locations</span><span class="o">.</span><span class="n">append</span><span class="p">(</span> <span class="n">res</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&#39;span&#39;</span><span class="p">,</span> <span class="p">{</span><span class="s">&#39;class&#39;</span> <span class="p">:</span> <span class="s">&#39;job-location&#39;</span><span class="p">})</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span> <span class="k">if</span>
<span class="n">res</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&#39;span&#39;</span><span class="p">,</span> <span class="p">{</span><span class="s">&#39;class&#39;</span> <span class="p">:</span> <span class="s">&#39;job-location&#39;</span><span class="p">})</span> <span class="k">else</span> <span class="s">&#39;None&#39;</span> <span class="p">)</span>
<span class="n">links</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">res</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&#39;a&#39;</span><span class="p">,{</span><span class="s">&#39;class&#39;</span> <span class="p">:</span> <span class="s">&#39;job-title-link&#39;</span><span class="p">})</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">&#39;href&#39;</span><span class="p">)</span> <span class="p">)</span></code></pre></figure>
<p>As I mentioned above, all the information about where to find the job details
are made easy thanks to source code viewing via any browser</p>
<p>Next, it’s time to create the data frame</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">jobs_linkedin</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">&#39;title&#39;</span> <span class="p">:</span> <span class="n">titles</span><span class="p">,</span> <span class="s">&#39;company&#39;</span><span class="p">:</span> <span class="n">companies</span><span class="p">,</span> <span class="s">&#39;location&#39;</span><span class="p">:</span> <span class="n">locations</span><span class="p">,</span> <span class="s">&#39;link&#39;</span> <span class="p">:</span> <span class="n">links</span><span class="p">})</span></code></pre></figure>
<p>Now the table is filled with the above columns. <br />
Just to verify, I can check the size of the table to make sure I got all the
postings</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">jobs_linkedin</span><span class="o">.</span><span class="n">count</span><span class="p">()</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text"> company 93
link 93
location 93
title 93
dtype: int64</code></pre></figure>
<p>In the end, I got an actual dataset just by scraping web pages. Gathering data
never have been as easy.
I can even go further by parsing the description of each posting page and
extract information like:<br />
- Level<br />
- Description<br />
- Technologies<br />
…</p>
<p>There are no limits to which extent we can exploit the information in HTML pages
thanks to BeautifulSoup, you just have to read the documentation which is very
good by the way, and get to practice on real pages.</p>
<p>Ciao!</p>
</description>
<pubDate>Sat, 28 May 2016 12:57:54 +0200</pubDate>
<link>http://ferodia.github.io/linkedin-data-scraping-with-beautifulsoup</link>
<guid isPermaLink="true">http://ferodia.github.io/linkedin-data-scraping-with-beautifulsoup</guid>
</item>
<item>
<title>Data analysis made simple</title>
<description><p>While roaming around looking for data to explore I came across this dataset in Kaggle website.
The data set contains information about the animals admitted in the shelter and the purpose is to predict their outcome.</p>
<p>But before we get to that let’s explore the files and get to know the features.</p>
<h2 id="warm-up">Warm up</h2>
<p>To read the csv files we need the library readr</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="kn">library</span><span class="p">(</span><span class="s">&quot;readr&quot;</span><span class="p">)</span></code></pre></figure>
<p><em>If you don’t have the library available you need to install it</em></p>
<p>Now let’s read the files and have a look at what we have.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r">animals <span class="o">&lt;-</span> read_csv<span class="p">(</span>file<span class="o">=</span><span class="s">&quot;train.csv&quot;</span><span class="p">)</span>
<span class="kp">head</span><span class="p">(</span>animals<span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## AnimalID Name DateTime OutcomeType OutcomeSubtype
## 1 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner &lt;NA&gt;
## 2 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering
## 3 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster
## 4 A683430 &lt;NA&gt; 2014-07-11 19:09:00 Transfer Partner
## 5 A667013 &lt;NA&gt; 2013-11-15 12:52:00 Transfer Partner
## 6 A677334 Elsa 2014-04-25 13:04:00 Transfer Partner
## AnimalType SexuponOutcome AgeuponOutcome
## 1 Dog Neutered Male 1 year
## 2 Cat Spayed Female 1 year
## 3 Dog Neutered Male 2 years
## 4 Cat Intact Male 3 weeks
## 5 Dog Neutered Male 2 years
## 6 Dog Intact Female 1 month
## Breed Color
## 1 Shetland Sheepdog Mix Brown/White
## 2 Domestic Shorthair Mix Cream Tabby
## 3 Pit Bull Mix Blue/White
## 4 Domestic Shorthair Mix Blue Cream
## 5 Lhasa Apso/Miniature Poodle Tan
## 6 Cairn Terrier/Chihuahua Shorthair Black/Tan</code></pre></figure>
<p>We can also check the dimension of the dataset<br />
26729, 10
Some processing to convert some columns to factors, since we have many of them we’ll use the magic lapply.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r">factors <span class="o">&lt;-</span> <span class="kt">c</span><span class="p">(</span><span class="s">&quot;OutcomeType&quot;</span><span class="p">,</span><span class="s">&quot;OutcomeSubtype&quot;</span><span class="p">,</span> <span class="s">&quot;AnimalType&quot;</span><span class="p">,</span><span class="s">&quot;AgeuponOutcome&quot;</span><span class="p">,</span><span class="s">&quot;SexuponOutcome&quot;</span><span class="p">,</span><span class="s">&quot;Breed&quot;</span><span class="p">,</span><span class="s">&quot;Color&quot;</span><span class="p">)</span>
animals<span class="p">[</span>factors<span class="p">]</span> <span class="o">&lt;-</span> <span class="kp">lapply</span><span class="p">(</span>animals<span class="p">[</span>factors<span class="p">],</span>FUN <span class="o">=</span><span class="kp">as.factor</span><span class="p">)</span></code></pre></figure>
<h2 id="know-your-data">Know your data</h2>
<p>I will proceed with some explorations to get to know the kind of information the dataset possesses. <br />
Summary is a very useful to check basic information about the data frame.
It also shows that we hve some NA, “Other”, “Unknown” values which might be a problem to get relevant statistical results and machine learning models.</p>
<p>That’s why I will start by some data observation and processign when needed to have the table more harmonised.</p>
<h3 id="age">Age</h3>
<p>The first observation is tht the age is expressed in various “units”:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="kp">levels</span><span class="p">(</span>animals<span class="o">$</span>AgeuponOutcome<span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] &quot;0 years&quot; &quot;10 months&quot; &quot;10 years&quot; &quot;11 months&quot; &quot;11 years&quot;
## [6] &quot;12 years&quot; &quot;13 years&quot; &quot;14 years&quot; &quot;15 years&quot; &quot;16 years&quot;
## [11] &quot;17 years&quot; &quot;18 years&quot; &quot;19 years&quot; &quot;1 day&quot; &quot;1 month&quot;
## [16] &quot;1 week&quot; &quot;1 weeks&quot; &quot;1 year&quot; &quot;20 years&quot; &quot;2 days&quot;
## [21] &quot;2 months&quot; &quot;2 weeks&quot; &quot;2 years&quot; &quot;3 days&quot; &quot;3 months&quot;
## [26] &quot;3 weeks&quot; &quot;3 years&quot; &quot;4 days&quot; &quot;4 months&quot; &quot;4 weeks&quot;
## [31] &quot;4 years&quot; &quot;5 days&quot; &quot;5 months&quot; &quot;5 weeks&quot; &quot;5 years&quot;
## [36] &quot;6 days&quot; &quot;6 months&quot; &quot;6 years&quot; &quot;7 months&quot; &quot;7 years&quot;
## [41] &quot;8 months&quot; &quot;8 years&quot; &quot;9 months&quot; &quot;9 years&quot;</code></pre></figure>
<p>I counted 44 units, in order to be able to use this information it should be expressed with the same unit, I chose the smallest unit existing which is “day”.
for each row I will apply a transformation by converting “week”,”month”,”year”,and “day”,
First the function is defined</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="kn">library</span><span class="p">(</span>stringr<span class="p">)</span>
convertAge <span class="o">&lt;-</span> <span class="kr">function</span><span class="p">(</span>age<span class="p">){</span>
<span class="c1"># first extract the digits</span>
regexp <span class="o">&lt;-</span> <span class="s">&quot;[[:digit:]]+&quot;</span>
result <span class="o">&lt;-</span> <span class="m">0</span>
digits <span class="o">&lt;-</span> <span class="kp">strtoi</span><span class="p">(</span>str_extract<span class="p">(</span>age<span class="p">,</span> regexp<span class="p">))</span>
<span class="kr">if</span> <span class="p">(</span><span class="kp">grepl</span><span class="p">(</span><span class="s">&quot;day&quot;</span><span class="p">,</span>age<span class="p">)){</span>
result <span class="o">&lt;-</span> digits
<span class="p">}</span> <span class="kr">else</span> <span class="kr">if</span><span class="p">(</span><span class="kp">grepl</span><span class="p">(</span><span class="s">&quot;week&quot;</span><span class="p">,</span>age<span class="p">))</span> <span class="p">{</span>
result <span class="o">&lt;-</span> digits<span class="o">*</span><span class="m">7</span>
<span class="p">}</span> <span class="kr">else</span> <span class="kr">if</span> <span class="p">(</span><span class="kp">grepl</span><span class="p">(</span><span class="s">&quot;month&quot;</span><span class="p">,</span>age<span class="p">))</span> <span class="p">{</span>
result <span class="o">&lt;-</span> digits<span class="o">*</span><span class="m">30</span>
<span class="p">}</span> <span class="kr">else</span> <span class="kr">if</span> <span class="p">(</span><span class="kp">grepl</span><span class="p">(</span><span class="s">&quot;year&quot;</span><span class="p">,</span>age<span class="p">))</span> <span class="p">{</span>
result <span class="o">&lt;-</span> digits<span class="o">*</span><span class="m">365</span>
<span class="p">}</span>
<span class="kr">return</span><span class="p">(</span>result<span class="p">)</span>
<span class="p">}</span></code></pre></figure>
<p>then I apply the conversion function to each row of the column age</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"> animals<span class="o">$</span>AgeuponOutcome <span class="o">&lt;-</span> <span class="kp">sapply</span><span class="p">(</span>animals<span class="o">$</span>AgeuponOutcome<span class="p">,</span>convertAge<span class="p">)</span></code></pre></figure>
<h3 id="dogs-vs-cats">Dogs vs Cats</h3>
<p>Here we will explore the correlation between the fate of the animal and its type.</p>
<p><img src="/ferodia.github.io/images/markdown_files/figure-markdown_github/unnamed-chunk-7-1.png" alt="center" /></p>
<p>From the plot, it seems that the animal type impacts somehow the outcome.
To make sure this sample is not skewed by dominance of one type over another let’s check first the distribution</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">##
## Cat Dog
## 0.4165513 0.5834487</code></pre></figure>
<p>The distribution is not perfectly balanced because Dogs represent 58% of the animals.
The outcome depends on the animal type</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">##
## Adoption Died Euthanasia Return_to_owner Transfer
## Cat 0.3966942 0.7461929 0.4565916 0.1044714 0.5842709
## Dog 0.6033058 0.2538071 0.5434084 0.8955286 0.4157291</code></pre></figure>
<h3 id="male-vs-female">Male vs Female</h3>
<p>In this part we are more interested in the gender, which in this case seems to be divided in 4 types:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] &quot;Intact Female&quot; &quot;Intact Male&quot; &quot;Neutered Male&quot; &quot;Spayed Female&quot;
## [5] &quot;Unknown&quot;</code></pre></figure>
<p>And “Unknown”, that can be any of the other 4.</p>
<p><img src="/ferodia.github.io/images/markdown_files/figure-markdown_github/unnamed-chunk-11-1.png" alt="center" /></p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">##
## Adoption Died Euthanasia Return_to_owner Transfer
## Intact Female 0.01885040 0.28426396 0.25787781 0.062904911 0.2706432
## Intact Male 0.01467174 0.40101523 0.30675241 0.099686520 0.2477181
## Neutered Male 0.48491039 0.09644670 0.22122186 0.469592476 0.2066440
## Spayed Female 0.48156746 0.09137056 0.14919614 0.365308255 0.1736362
## Unknown 0.00000000 0.12690355 0.06495177 0.002507837 0.1013585</code></pre></figure>
<p>The challenge is to fill the unknown with the right values : female (spayed, or intact), male (neutered, or intact), we will use basic knowledge as well as the other features.</p>
<p>The first thing to try is to infer the gender based on color using the following golden rule:<br />
*For genetical reasons, only females are calico, which means they have three colors (white, orange and black), they can happen to be male, but this means they have a genetic anomaly (XXY chromosomes), but I won’t go that far.<br />
Some numbers to prove my point:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="kp">table</span><span class="p">(</span>animals<span class="o">$</span>SexuponOutcome<span class="p">[</span>animals<span class="p">[</span><span class="s">&quot;Color&quot;</span><span class="p">]</span> <span class="o">==</span> <span class="s">&quot;Calico&quot;</span><span class="p">])</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">##
## Intact Female Intact Male Neutered Male Spayed Female Unknown
## 198 3 1 296 19</code></pre></figure>
<p>when I display the count of cats that are “Calico” per gender, out of more than 400 cats, only 4 of them are male. Therefore the assumption that the 19 unknown are female is not harm the statistics.
Now the main problem remains what kind of female ? Intact or spayed ?</p>
<p>At first, I can derive some intuition: the animals are born intact, and are spayed/neutered at some point in their life which should not happen before some age, for example a cat who is 1 week is too young to be spayed and vice versa, an old cat is more likely to have been spayed already, let’s plot age as a function of gender to verify this theory.</p>
<p><img src="/ferodia.github.io/images/markdown_files/figure-markdown_github/unnamed-chunk-13-1.png" alt="center" /></p>
<p>Until the age of 30 days, the neutered/spayed animals are inexistent, which makes sense from a scientific point o view because the animals are too young. The threshold I will use is 30.<br />
The following conclusion is drawn
<strong>The Calico cats under the age of 30 days are all intact females</strong></p>
<p>Let’s apply it</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r">animals<span class="o">$</span>SexuponOutcome<span class="p">[</span>animals<span class="o">$</span>AgeuponOutcome <span class="o">&lt;=</span><span class="m">30</span> <span class="o">&amp;</span> animals<span class="o">$</span>SexuponOutcome <span class="o">==</span> <span class="s">&quot;Unknown&quot;</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="s">&quot;Intact Female&quot;</span></code></pre></figure>
<p>The other part of data with unknown gender could be useful to predict the outcome of the animal</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="kp">table</span><span class="p">(</span>animals<span class="o">$</span>OutcomeType<span class="p">[</span>animals<span class="p">[</span><span class="s">&quot;SexuponOutcome&quot;</span><span class="p">]</span> <span class="o">==</span> <span class="s">&quot;Unknown&quot;</span><span class="p">])</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">##
## Adoption Died Euthanasia Return_to_owner
## 0 8 57 10
## Transfer
## 322</code></pre></figure>
<p>For example, animal of unknown type will never be adopted.</p>
<h3 id="feature-engineering">Feature engineering</h3>
<p>Now let’s move on to create new features.</p>
<h2 id="hasname">HasName</h2>
<p>To simply the processing of name, the characters themseves are not useful in the context of learning and outcome prediction. However the presence is important. It means most of the time that the animal belonged to somebody how it a certain name.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r">animals<span class="o">$</span>hasName <span class="o">&lt;-</span> <span class="kp">sapply</span><span class="p">(</span>animals<span class="o">$</span>Name<span class="p">,</span>FUN <span class="o">=</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">)</span> <span class="o">!</span><span class="kp">is.na</span><span class="p">(</span>x<span class="p">))</span>
animals<span class="o">$</span>hasName <span class="o">&lt;-</span> <span class="kp">factor</span><span class="p">(</span>animals<span class="o">$</span>hasName<span class="p">)</span></code></pre></figure>
<p>That’s all for today. See you next time!</p>
<p>Ciao!</p>
</description>
<pubDate>Sun, 01 May 2016 19:46:52 +0200</pubDate>
<link>http://ferodia.github.io/blog/2016/Data-Analysis-Animals/</link>
<guid isPermaLink="true">http://ferodia.github.io/blog/2016/Data-Analysis-Animals/</guid>
<category>jekyll</category>
<category>update</category>
</item>
</channel>
</rss>